HTTP API Outage

Incident Report for gridX GmbH

Postmortem

We rolled out a new backend deployment containing a database migration that had unforeseen side-effects and lead to excessive locking of a table with high access frequency. These locks eventually lead to the DB becoming unresponsive.

While the migration was tested on locally, in CI and on staging, the problem wasn’t caught earlier because it manifested with production level traffic only.

Summary

A backend database migration unexpectedly locked a table which caused ongoing requests to pile up and never finish.

Aborting the migration, scaling in and out again resolved the issue.

Root Cause

Deployment of a new backend version containing a migration that lead to excessive locking under high load.

Resolution

  • Rolled back deployment, aborting the running migration
  • Scaled down replicas completely to let the DB recover
  • Scaling back up when DB recovered.

Action Items

  • ✅ Improve our response times for similar issues by codifying the resolution into runbooks
  • ✅ Improve alerting to catch similar issues earlier
  • ✅ Update standard operating procedures to include more rigorous checks for migration risks
Posted Oct 10, 2024 - 10:22 CEST

Resolved

This incident has been resolved.
Posted Oct 09, 2024 - 14:00 CEST

Monitoring

Services are back up at normal scale. We continue to monitor operations.
Posted Oct 09, 2024 - 13:56 CEST

Update

Database has recovered, we're scaling out again.
Posted Oct 09, 2024 - 13:53 CEST

Update

A data migration unexpectedly caused excessive load on a backend database causing it to lock up. We're redirecting traffic away from it to let it recover.
Posted Oct 09, 2024 - 13:50 CEST

Identified

The issue has been identified and a fix is being implemented.
Posted Oct 09, 2024 - 13:44 CEST
This incident affected: Frontend and Public API.