Production War Stories - Real SRE Post-Mortems

Critical - P1 Kubernetes

How a single pod memory exhaustion crashed our payment system

A memory leak in a new coupon feature caused cascading Out-Of-Memory (OOM) kills across our payment microservices namespace, leading to 12 minutes of complete downtime under load.

⚙️ Tech Stack Analyze Outage

Critical - P1 CI/CD

A pipeline sync bug that cleared our frontend S3 bucket

A wrong environment variable in a CI build runner caused aws s3 sync --delete to target the production bucket instead of staging, deleting all static web assets in seconds.

🔄 Tech Stack Analyze Outage

Major - P2 AWS

A database connection leak that triggered query timeouts under traffic

A background job worker failed to close database connection sockets, exhausting the RDS pool limits and taking down our API for 23 minutes.

☁️ Tech Stack Analyze Outage

Critical - P1 AWS

How a caching mistake served private user data to anonymous visitors

A CDN header optimization task stripped authorization tokens from cache key definitions, caching private profile pages and serving them to random visitors for 22 minutes.

☁️ Tech Stack Analyze Outage

Major - P2 Networking

How a high DNS TTL value delayed our database migration by 12 hours

A legacy 86,400-second (24-hour) DNS TTL value caused client applications to continue writing to our old database master after DNS records were switched, delaying our migration.

🌐 Tech Stack Analyze Outage

Major - P2 CI/CD

How an ArgoCD sync loop saturated our cluster CPU and crashed our monitoring

A misconfigured Helm chart created an infinite resource synchronization loop in ArgoCD, causing controller CPU utilization to spike to 100% and taking down our Prometheus instance.