On Apr 2, 2025 a service degradation caused intermittent requests for static assets to fail; these included requests for HTML, JS, CSS and other assets resulting in failed delivery of frontend applications for several short bursts of time.
The incident resulted in degraded delivery of static assets used in the frontend applications, manifesting in the following:
The incident did not affect usage of the API and browser clients which had cached the static asset files.
Our cloud hosting provider terminated several EC2 instances in our Kubernetes fleet over several hours the morning of April 2. The NGINX proxy that delivers static assets was forced to recreate on another node, resulting in several seconds of failed requests for assets. This occurred several times in succession.
* Flatifle infrastructure engineers scaled NGINX resources across the fleet to avoid downtime during disruptions
* We implemented new routing and retry strategy combined with affinity rules to prevent scheduling on ephemeral resources