Measurement processing

Incident Report for gridX GmbH

Postmortem

On 2024-11-19 at ~05:00, our data processing pipeline experienced elevated memory usage and cascading failures in kubernetes replicas responsible for streaming writes to Amazon Kinesis.
These failures resulted in a disruption of our data-processing operations, impacting the live and historical views.

Root Cause

Our investigation suggests that the issue stemmed from how retries were handled during intermittent write latency spikes to Amazon Kinesis. Although no shard throughput limits were exceeded, specific replicas experienced degraded performance due to persistent retry behavior and memory buildup. This was further compounded by internal retry management within the AWS SDK, which amplified resource contention on affected replicas.

Resolution

Scaled down client-side writes to the gRPC server to reduce system load.
Gradually increased server capacity and downstream infrastructure, including adding Kinesis shards.
Slowly scaled up client write throughput while monitoring system performance.

These actions resolved the immediate issue and stabilized the pipeline.

Action Items

Improve usage of AWS library to mitigate ongoing issues
Reduce blast radius of slow processing on live measurements
Improve how quickly we are able to recover from disruptions in our streaming operations.

Posted Nov 19, 2024 - 17:36 CET

Resolved

This incident has been resolved.

Posted Nov 19, 2024 - 15:25 CET

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 19, 2024 - 11:55 CET

Update

We have recovered to processing the most recent data, live data is available again and backfills for historic data are ongoing while we monitor system performance.

Posted Nov 19, 2024 - 10:15 CET

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 19, 2024 - 08:15 CET

Investigating

We are currently experiencing a disruption in our data processing. We are currently investigating.

Posted Nov 19, 2024 - 07:29 CET

This incident affected: Platform Components (beta) (Measurement processing).