Partial/Temporary Outage

Incident Report for gridX GmbH

Postmortem

Due to a large scale roll-out of gridBoxes, the gateways/…/scans endpoint was called significantly more often than expected. The implementation of this endpoint was suboptimal, so we could not keep up with the load, which lead to timeouts and subsequent retries, further exacerbating the problem.

Resolution

  • Scale in replicas serving the endpoint in question to relieve pressure from the database.
  • Disable gateway.../scans endpoint
  • Scale out again
  • Analyze and optimize the sub-optimal query used by the endpoint in question.
  • Re-enable gateway.../scans endpoint

Action Items

Reduce polling interval of the gateways/…/scan endpoint in Phoenix

Optimize query backing the endpoint in question and re-enable

  • Reconsider how scans are handled and stored
  • Audit existing queries on large tables to identify potential future bottlenecks

See also https://status.gridx.de/incidents/9nd1bqyb6bdy

Posted Apr 02, 2025 - 15:40 CEST

Resolved

This incident has been resolved.
Posted Mar 25, 2025 - 21:22 CET

Update

The issue is solved. Systems are back.
Posted Mar 25, 2025 - 20:57 CET

Update

The fix is in place, public API and Frontend are back. We are working on fixing a degradation on gridBox measurements.
Posted Mar 25, 2025 - 20:37 CET

Update

We are continuing to work on a fix for this issue.
Posted Mar 25, 2025 - 20:11 CET

Update

We are continuing to work on a fix for this issue.
Posted Mar 25, 2025 - 20:07 CET

Update

We can trace down the issue to a single endpoint. Implementing a fix to short circuit this endpoint now.
Posted Mar 25, 2025 - 19:40 CET

Identified

The issue has been identified and a fix is being implemented.
Posted Mar 25, 2025 - 18:37 CET

Update

We are continuing to investigate this issue.
Posted Mar 25, 2025 - 17:54 CET

Investigating

We are currently investigating this issue.
Posted Mar 25, 2025 - 17:36 CET
This incident affected: Frontend, Public API, and Private gridBox API.