Date and Time of Incident: July 3rd starting at about 7:15pm MT to about 7:45pm MT
Nature of Incident: Ephemeral Database CPU Pinning on Australia Regional Environment
Services Affected: All Workbooks and Sheets endpoints
At approximately 7:15pm MT, we received a report from a customer on the Australia region that a database timeout error was being returned by the API. Upon investigation it was determined that the ephemeral database was experiencing degraded performance due to CPU load. We also investigated the API service in the Australia regional ECS cluster and rolled over the API deployment as a precautionary measure.
All Australia Regional platform users were unable to load workbooks or sheets during the incident. This resulted in degraded service for most customer workflows during the incident, including completely blocking some workflows due to API response failures. The incident was fully resolved about 30 minutes after initial report.
The root cause of this was CPU load on the ephemeral database instance. Upon investigation we discovered that the instance was a small size unable to handle the amount of load on it. We suspect also that background load due to postgres auto-vacuum might have been occurring at this time, exacerbating the problem.
The ephemeral datastore instance was scaled up to a much larger instance size. This was sufficient to relieve the load and accommodate a much larger load from application users. We have confirmed that this resolved the problem. Going forward we will audit our regional deployments and determine if ephemeral datastores are at an appropriate scale to the volume of traffic on the region.
There was no loss of customer data and no security breach. We have reviewed application and database logs related to the incident and concluded that the database capacity problem was the root cause and that no other parts of the system experienced degradation.