A misconfiguration in our Kubernetes node provisioning system (Karpenter) caused an outage impacting public-facing services from 13:15 to 14:00 CET. Measurement ingestion was unaffected, and no data was lost. During the outage, our local EMS operated in fallback mode, maintaining full local functionality, though cloud optimizations may have been impacted.
Root Cause: A misconfiguration in a core Karpenter resource triggered a cascading issue, disrupting normal cluster operation. Karpenter's dynamic nature complicated immediate manual intervention.
Resolution: The misconfiguration was reverted.
Action Items: