HTTP API Outage
Incident Report for gridX GmbH
Postmortem

We rolled out a new backend deployment containing a database migration that had unforeseen side-effects and lead to excessive locking of a table with high access frequency. These locks eventually lead to the DB becoming unresponsive.

While the migration was tested on locally, in CI and on staging, the problem wasn’t caught earlier because it manifested with production level traffic only.

Summary

A backend database migration unexpectedly locked a table which caused ongoing requests to pile up and never finish.

Aborting the migration, scaling in and out again resolved the issue.

Root Cause

Deployment of a new backend version containing a migration that lead to excessive locking under high load.

Resolution

  • Rolled back deployment, aborting the running migration
  • Scaled down replicas completely to let the DB recover
  • Scaling back up when DB recovered.

Action Items

  • ✅ Improve our response times for similar issues by codifying the resolution into runbooks
  • ✅ Improve alerting to catch similar issues earlier
  • ✅ Update standard operating procedures to include more rigorous checks for migration risks
Posted Oct 10, 2024 - 10:22 CEST

Resolved
This incident has been resolved.
Posted Oct 09, 2024 - 14:00 CEST
Monitoring
Services are back up at normal scale. We continue to monitor operations.
Posted Oct 09, 2024 - 13:56 CEST
Update
Database has recovered, we're scaling out again.
Posted Oct 09, 2024 - 13:53 CEST
Update
A data migration unexpectedly caused excessive load on a backend database causing it to lock up. We're redirecting traffic away from it to let it recover.
Posted Oct 09, 2024 - 13:50 CEST
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 09, 2024 - 13:44 CEST
This incident affected: Frontend and Public API.