Date and Time of Incident: November 4 08:15 PST - 10:51 PST
Nature of Incident: Exhausted Database Connections due to Thawing Expired Workbooks
Services Affected: Flatfile Platform US Region
On November 4 a daily event expired a large number of workbooks. Many of the expired workbooks had been backed up and moved to cold storage. The expiry event caused these workbooks to first thaw before being marked as expired and purged. This led to elevated database connections growing for a period of time that ultimately exhausted database resources at approximately 8:15 PST. This caused intermittent service degradations across the platform that manifested as API requests randomly failing when the API was unable to open a database connection.
Flatfile engineering took steps to short circuit the thawing process for expired workbooks and hardened the thaw process so that a large number of workbooks moving from cold storage to “hot” storage would not spike database resources.
During the incident window, some API requests would randomly fail leading to unpredictable behavior across the Platform.
The data retention policy feature allows a customer to automatically purge workbooks that have not been active in a set amount of days. This feature would run on a cron-like basis to identify and purge workbooks. It was discovered that this conflicted with the cold storage system for workbooks where any workbook that was backed up and moved to “cold” storage (S3 backup) would first be “thawed” and moved into “hot” storage. In the early morning of the 4th, a large number of workbooks in cold storage were identified for expiry and several thousand workbooks entered the queue to await thaw. Over the course of several hours, the database connection pool became saturated causing API requests to intermittently fail while also causing workbooks in the thaw process to fail and be re-enqueued which exacerbated the connection pool problem.
Flatfile engineering took steps to implement a short circuit for expired workbooks to resolve the immediate symptoms. The engineering team followed up with a series of steps to harden the queue, the database connection pool, and the thaw mechanism as well.
Please be assured that this incident did not compromise the security or integrity of your data. Our commitment to data protection remains a top priority.