"Something went wrong" errors

Incident Report for Flatfile

Postmortem

Introduction

On March 14, 2025, our team identified an issue where certain workbooks were failing to open and/or update. These failures were caused by a database incident involving one of our ephemeral database servers. This document outlines the incident details, the identified root cause, the steps taken to resolve the issue, and the long-term remediation plan.

Incident Details

  • Date Reported: March 14, 2025
  • Issue Summary: One of Flatfile’s ephemeral database instances entered an abnormal state. Workbooks mounted to this database instance failed to open and/or be updated. 

Impact Assessment

The incident resulted in degraded service performance for users with workbooks on the Quickstore 3 database. Specifically, users experienced:

  1. Intermittent unavailability of existing workbooks stored on the affected database
  2. Issues loading sheets in newly created spaces that attempted to access data from the affected database

The incident did not affect the creation of new workbooks, as these would be directed to functioning database instances. Only workbooks that were already stored on the Quickstore 3 instance were impacted, leading to a compromised user experience for a subset of users.

Root Cause

Initial investigations determined that the Quickstore 3 database had entered an abnormal state. The database writer node became unresponsive, preventing both read and write operations from completing successfully. While the exact trigger for this state is still under investigation, monitoring data suggests that the database instance may have experienced resource exhaustion or an internal failure that was not automatically resolved by the database management system.

Resolution & Fix

  1. Immediate Remediation
* A backup of the affected database instance was completed to secure all data.
* A new database instance was brought online to attempt to maintain service availability.
* A new reader node was spun up while planning to remove the problematic node from service.
  1. Recovery Strategy
* After evaluating options, Flatfile launched a new database cluster using the backup at the same time that the reader node was coming online in case the additional reader node was unable to make the database healthy again.

Follow-Up Actions

  • Monitoring Enhancement: While monitoring for this type of issue exists and alerts triggered correctly, enhancements could be made to escalate alerts and prompt faster response times.
  • Root Cause Investigation: Continue the investigation into database monitoring data to determine what initially caused the Quickstore 3 database to enter the problematic state.
Posted Mar 17, 2025 - 08:03 PDT

Resolved

This incident has been resolved.
Posted Mar 14, 2025 - 09:03 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 14, 2025 - 08:57 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted Mar 14, 2025 - 08:16 PDT

Investigating

We are seeing some intermittent "something went wrong" errors when trying to load sheets for some users. We are currently investigating this.
Posted Mar 14, 2025 - 07:33 PDT
This incident affected: Flatfile Platform (Spaces).