On 2024-11-19 at ~05:00, our data processing pipeline experienced elevated memory usage and cascading failures in kubernetes replicas responsible for streaming writes to Amazon Kinesis.
These failures resulted in a disruption of our data-processing operations, impacting the live and historical views.
Our investigation suggests that the issue stemmed from how retries were handled during intermittent write latency spikes to Amazon Kinesis. Although no shard throughput limits were exceeded, specific replicas experienced degraded performance due to persistent retry behavior and memory buildup. This was further compounded by internal retry management within the AWS SDK, which amplified resource contention on affected replicas.
These actions resolved the immediate issue and stabilized the pipeline.
Improve usage of AWS library to mitigate ongoing issues
Reduce blast radius of slow processing on live measurements
Improve how quickly we are able to recover from disruptions in our streaming operations.