AU Region Loading

Incident Report for Flatfile

Postmortem

Incident Overview

Date and Time of Incident: July 3rd starting at about 7:15pm MT to about 7:45pm MT

Nature of Incident: Ephemeral Database CPU Pinning on Australia Regional Environment

Services Affected: All Workbooks and Sheets endpoints

Details of the Incident

At approximately 7:15pm MT, we received a report from a customer on the Australia region that a database timeout error was being returned by the API. Upon investigation it was determined that the ephemeral database was experiencing degraded performance due to CPU load. We also investigated the API service in the Australia regional ECS cluster and rolled over the API deployment as a precautionary measure.

Impact Assessment

All Australia Regional platform users were unable to load workbooks or sheets during the incident. This resulted in degraded service for most customer workflows during the incident, including completely blocking some workflows due to API response failures. The incident was fully resolved about 30 minutes after initial report.

Root Cause

The root cause of this was CPU load on the ephemeral database instance. Upon investigation we discovered that the instance was a small size unable to handle the amount of load on it. We suspect also that background load due to postgres auto-vacuum might have been occurring at this time, exacerbating the problem.

Resolution

The ephemeral datastore instance was scaled up to a much larger instance size. This was sufficient to relieve the load and accommodate a much larger load from application users. We have confirmed that this resolved the problem. Going forward we will audit our regional deployments and determine if ephemeral datastores are at an appropriate scale to the volume of traffic on the region.

Security and Data Integrity

There was no loss of customer data and no security breach. We have reviewed application and database logs related to the incident and concluded that the database capacity problem was the root cause and that no other parts of the system experienced degradation.

Posted Jul 07, 2025 - 08:23 PDT

Resolved

This incident has been resolved.
Posted Jul 03, 2025 - 18:52 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jul 03, 2025 - 18:44 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted Jul 03, 2025 - 18:39 PDT

Investigating

We are currently investigating an issue where our AU regional server is not loading for some customers. We are working to get a fix out as quickly as possible.
Posted Jul 03, 2025 - 18:18 PDT
This incident affected: Australia Regional Platform (AU Regional API, AU Spaces, AU Dashboard).