API performance degraded

Incident Report for gridX GmbH

Postmortem

Summary

We rolled out a new backend deployment containing changes in data access code that lead to a significant performance degradation that eventually overloaded the database.

It was not caught in the test suites and on staging, as it only manifested with production level traffic.

This led to cascading failures of other services as requests started to time out increasingly.

Rolling back the deployment did not immediately have the desired effect. The replicas with the previous version failed to start up as they could not connect to the - still overloaded - database.

Root Cause

We rolled out a new backend deployment containing changes to selected, high frequency SQL queries which put unexpected high load on our database.

Resolution

Rolled back the deployment containing the problematic code
Scaled down replicas to reduce load on the database to let it recover
Slowly scaled up backend replicas again over the space of half an hour

Action Items

✅ Fix problematic SQL queries & roll out updated version again.
✅ Share SQL query analysis techniques that help to catch these issues in advance; lessons learned.

Posted Apr 04, 2024 - 10:36 CEST

Resolved

All operations are back to normal now. No data has been lost.

Posted Apr 02, 2024 - 17:23 CEST

Update

Latencies and error rates have stabilized. We're continuing to closely monitor the situation.

Posted Apr 02, 2024 - 16:26 CEST

Update

We scaled down some services to create breathing room for other services to catch up.
Error rates are back to normal levels now. We expect higher than normal latencies for the moment.

We continue to carefully scale up and monitor the situation.

Posted Apr 02, 2024 - 15:58 CEST

Update

We are still seeing parts of the system being affected, as the problematic deployment lead to significantly increased DB load.

Posted Apr 02, 2024 - 15:18 CEST

Update

We are still seeing issues with some parts of XENON, e.g. the live view.

Posted Apr 02, 2024 - 15:06 CEST

Update

We identified a problematic deployment which has been rolled back by now.
Error levels are back to normal.

We continue to monitor the system and find the root cause.

Posted Apr 02, 2024 - 14:40 CEST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 02, 2024 - 14:26 CEST

Identified

The issue has been identified and a fix is being implemented.

Posted Apr 02, 2024 - 14:20 CEST

Investigating

We are currently investigating this issue.

Posted Apr 02, 2024 - 14:04 CEST

This incident affected: Public API.