Degraded HTTP API Performance

Incident Report for gridX GmbH

Postmortem

Summary

The load on the HTTP API unexpectedly increased significantly during a deployment of a separate internal service. This led to the HTTP API being degraded for about one hour and completely unavailable for 5 minutes.

The incident was mitigated by rolling back the update and scaling the main HTTP API backend up and out.

Root Cause

We updated an internal service that changed its access patterns against our HTTP API. This turned up an efficiency issue in a single endpoint that was not visible under lower load.

Resolution

Rolled back the deployment containing the update
Scaled HTTP API replicas up and out

Action Items

Investigate/fix inefficient endpoint; running with more resources until this is done
✅ Roll out the changes that triggered the incident again, but shift traffic more gradually
✅ Increase collaboration during rollouts that might impact other teams
Provide dedicated internal services for specialized use-cases to reduce blast radius

Posted Apr 17, 2024 - 09:18 CEST

Resolved

This incident has been resolved.

Posted Apr 12, 2024 - 17:45 CEST

Monitoring

The deployment of an internal component unexpectedly caused additional load on the parts serving the HTTP API.
The deployment was rolled back, the HTTP API replicas scaled up and out.

Posted Apr 12, 2024 - 17:31 CEST

Update

We're scaling up our HTTP API serving components to cope with increased load.
It's not yet clear what the root cause is yet.
System is stabilizing.

Posted Apr 12, 2024 - 17:16 CEST

Investigating

We are currently investigating this issue.

Posted Apr 12, 2024 - 16:49 CEST

This incident affected: Public API.