Investigating elevated error rates in measurement post-processing of certain event types, we found that we were using up too many Redis connections. This in turn led to retries and eventually to the connection pool exhaustion, so no further connections could be established, finally causing all invocations to fail.
The incident had a minor impact. Our displayed inverter status, fuse protection status, and EV plug state were wrong during the incident but corrected automatically with the incident resolution. Actual functionality of the features was not impacted.
After deploying a fix, our users might have noticed a slight increase in notifications, but events older than 5 minutes were discarded, so as not to overwhelm users with unnecessary outdated information.
The Redis connection pool was exhausted due to more and slower lambda invocations and could not recover under load, as the connections were consumed faster than they were returned to the pool. Retried connection attempts exacerbated the issue.