October 8, 2024·14 min read

Kubernetes at 3am: A Post-Mortem

A single ConfigMap change cascaded into 6 hours of incident response, two rollbacks, and one very long retrospective. Here's the full story — including what we changed so it can't happen again.

devopskubernetesincident-responsereliability

Kubernetes at 3am: A Post-Mortem

This is a placeholder. Replace with your actual content.

Timeline

- 02:47 — PagerDuty fires. Latency P99 spikes to 45 seconds. - 03:01 — First engineer online. Pods look healthy. Confusion sets in. - 03:24 — ConfigMap change identified as root cause. - 05:12 — Full rollback complete. Service stable. - 08:30 — Retrospective begins.

Root Cause

A ConfigMap update changed a connection pool size from 50 to 5. The change looked innocuous in review. Under load, the reduced pool caused a cascade of timeouts.

What We Changed

1. ConfigMap changes now require a load test in staging 2. Connection pool size is now an alarm threshold in Grafana 3. Rollback runbooks are now part of every deploy

← all posts