Measurement processing
Incident Report for gridX GmbH
Postmortem

On 2024-11-19 at ~05:00, our data processing pipeline experienced elevated memory usage and cascading failures in kubernetes replicas responsible for streaming writes to Amazon Kinesis.
These failures resulted in a disruption of our data-processing operations, impacting the live and historical views.

Root Cause

Our investigation suggests that the issue stemmed from how retries were handled during intermittent write latency spikes to Amazon Kinesis. Although no shard throughput limits were exceeded, specific replicas experienced degraded performance due to persistent retry behavior and memory buildup. This was further compounded by internal retry management within the AWS SDK, which amplified resource contention on affected replicas.

Resolution

  • Scaled down client-side writes to the gRPC server to reduce system load.
  • Gradually increased server capacity and downstream infrastructure, including adding Kinesis shards.
  • Slowly scaled up client write throughput while monitoring system performance.

These actions resolved the immediate issue and stabilized the pipeline.

Action Items

  • Improve usage of AWS library to mitigate ongoing issues

  • Reduce blast radius of slow processing on live measurements

  • Improve how quickly we are able to recover from disruptions in our streaming operations.

Posted Nov 19, 2024 - 17:36 CET

Resolved
This incident has been resolved.
Posted Nov 19, 2024 - 15:25 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 19, 2024 - 11:55 CET
Update
We have recovered to processing the most recent data, live data is available again and backfills for historic data are ongoing while we monitor system performance.
Posted Nov 19, 2024 - 10:15 CET
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 19, 2024 - 08:15 CET
Investigating
We are currently experiencing a disruption in our data processing. We are currently investigating.
Posted Nov 19, 2024 - 07:29 CET
This incident affected: Platform Components (beta) (Measurement processing).