Intermittent 503 Errors

Incident Report for Flatfile

Postmortem

Introduction

On Apr 2, 2025 a service degradation caused intermittent requests for static assets to fail; these included requests for HTML, JS, CSS and other assets resulting in failed delivery of frontend applications for several short bursts of time.

Incident Details

Date Reported: April 2, 2025
Issue Summary: Delivery of frontend application assets degraded

Impact Assessment

The incident resulted in degraded delivery of static assets used in the frontend applications, manifesting in the following:

Intermittent errors loading spaces
Missing assets in applications
NGINX error pages being viewed instead of Spaces

The incident did not affect usage of the API and browser clients which had cached the static asset files.

Root Cause

Our cloud hosting provider terminated several EC2 instances in our Kubernetes fleet over several hours the morning of April 2. The NGINX proxy that delivers static assets was forced to recreate on another node, resulting in several seconds of failed requests for assets. This occurred several times in succession.

Resolution & Fix

Immediate Remediation

* Flatifle infrastructure engineers scaled NGINX resources across the fleet to avoid downtime during disruptions

Recovery Strategy

* We implemented new routing and retry strategy combined with affinity rules to prevent scheduling on ephemeral resources

Follow-Up Actions

Monitoring Enhancement: While monitoring for this type of issue exists and alerts triggered correctly, enhancements could be made to escalate alerts and prompt faster response times.

Posted Apr 24, 2025 - 12:12 PDT

Resolved

This incident has been resolved.

Posted Apr 02, 2025 - 12:08 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 02, 2025 - 11:07 PDT

Investigating

We are investigating an issue where some users are seeing intermittent 503 errors

Posted Apr 02, 2025 - 10:53 PDT

This incident affected: Flatfile Platform (Spaces).