We rolled out a new backend deployment containing changes in data access code that lead to a significant performance degradation that eventually overloaded the database.
It was not caught in the test suites and on staging, as it only manifested with production level traffic.
This led to cascading failures of other services as requests started to time out increasingly.
Rolling back the deployment did not immediately have the desired effect. The replicas with the previous version failed to start up as they could not connect to the - still overloaded - database.
We rolled out a new backend deployment containing changes to selected, high frequency SQL queries which put unexpected high load on our database.