Inverter, Fuse Protection and EV Plug-State status display incorrect
Incident Report for gridX GmbH
Postmortem

Summary

Investigating elevated error rates in measurement post-processing of certain event types, we found that we were using up too many Redis connections. This in turn led to retries and eventually to the connection pool exhaustion, so no further connections could be established, finally causing all invocations to fail.

The incident had a minor impact. Our displayed inverter status, fuse protection status, and EV plug state were wrong during the incident but corrected automatically with the incident resolution. Actual functionality of the features was not impacted.

After deploying a fix, our users might have noticed a slight increase in notifications, but events older than 5 minutes were discarded, so as not to overwhelm users with unnecessary outdated information.

Root Cause

The Redis connection pool was exhausted due to more and slower lambda invocations and could not recover under load, as the connections were consumed faster than they were returned to the pool. Retried connection attempts exacerbated the issue.

Resolution

  • Reconfigure database connection pooling and the client concurrent invocations parameters.

Action Items

  • Improve alerts and dashboards to catch similar scenarios earlier.
  • Review database connection handling code and make sure connections are released quickly.
Posted Jul 22, 2024 - 12:25 CEST

Resolved
This incident has been resolved.
Posted Jul 16, 2024 - 00:00 CEST