Degraded HTTP API Performance
Incident Report for gridX GmbH
Postmortem

Summary

The load on the HTTP API unexpectedly increased significantly during a deployment of a separate internal service. This led to the HTTP API being degraded for about one hour and completely unavailable for 5 minutes.

The incident was mitigated by rolling back the update and scaling the main HTTP API backend up and out.

Root Cause

We updated an internal service that changed its access patterns against our HTTP API. This turned up an efficiency issue in a single endpoint that was not visible under lower load.

Resolution

  • Rolled back the deployment containing the update
  • Scaled HTTP API replicas up and out

Action Items

  • Investigate/fix inefficient endpoint; running with more resources until this is done
  • ✅ Roll out the changes that triggered the incident again, but shift traffic more gradually
  • ✅ Increase collaboration during rollouts that might impact other teams
  • Provide dedicated internal services for specialized use-cases to reduce blast radius
Posted Apr 17, 2024 - 09:18 CEST

Resolved
This incident has been resolved.
Posted Apr 12, 2024 - 17:45 CEST
Monitoring
The deployment of an internal component unexpectedly caused additional load on the parts serving the HTTP API.
The deployment was rolled back, the HTTP API replicas scaled up and out.
Posted Apr 12, 2024 - 17:31 CEST
Update
We're scaling up our HTTP API serving components to cope with increased load.
It's not yet clear what the root cause is yet.
System is stabilizing.
Posted Apr 12, 2024 - 17:16 CEST
Investigating
We are currently investigating this issue.
Posted Apr 12, 2024 - 16:49 CEST
This incident affected: Public API.