Kubernetes at 3am: A Post-Mortem
A single ConfigMap change cascaded into 6 hours of incident response, two rollbacks, and one very long retrospective. Here's the full story — including what we changed so it can't happen again.
Kubernetes at 3am: A Post-Mortem
This is a placeholder. Replace with your actual content.
Timeline
- 02:47 — PagerDuty fires. Latency P99 spikes to 45 seconds. - 03:01 — First engineer online. Pods look healthy. Confusion sets in. - 03:24 — ConfigMap change identified as root cause. - 05:12 — Full rollback complete. Service stable. - 08:30 — Retrospective begins.
Root Cause
A ConfigMap update changed a connection pool size from 50 to 5. The change looked innocuous in review. Under load, the reduced pool caused a cascade of timeouts.
What We Changed
1. ConfigMap changes now require a load test in staging 2. Connection pool size is now an alarm threshold in Grafana 3. Rollback runbooks are now part of every deploy