Type to search the DevOpsManual references...

Press Esc to close
Critical - P1 Kubernetes

How a single pod memory exhaustion crashed our payment system

Outage Synopsis:

A memory leak in a new coupon feature caused cascading Out-Of-Memory (OOM) kills across our payment microservices namespace, leading to 12 minutes of complete downtime under load.

## The Incident **Duration:** 47 minutes of degraded service, 12 minutes of complete outage. **Impact:** Checkout gateway unresponsive; payment processing failed for ~2,400 active sessions. **Root Cause:** Memory leak in coupon cache dict + restrictive Pod memory limits + lack of Horizontal Pod Autoscaling (HPA). --- ### Timeline **14:32** - Automated deployment of v2.4.1 completes. The build contains the new coupon lookup cache. Pod replicas show Running. **14:41** - First warning fires: `payment-service` p99 query latency exceeds 2.5 seconds. On-call engineer initiates log checks. **14:43** - Critical alert: API gateway error rate climbs past 5%. Gateway logs show socket timeouts. **14:44** - Pod status check reveals container terminations: ```bash kubectl get pods -n payments # payment-service-5c8f8-xj2p 0/1 OOMKilled 1 3m # payment-service-5c8f8-ab4c 1/1 Running 0 8m # payment-service-5c8f8-qw9z 0/1 OOMKilled 2 3m ``` **14:46** - With two pods killed, all incoming traffic gets routed to the single remaining pod. The extreme load spikes its memory usage, triggering a third OOMKill within 90 seconds. **14:47** - Complete service outage. AWS Application Load Balancer (ALB) health checks fail, returning 503 Service Unavailable. **14:49** - On-call engineer decides to roll back to v2.4.0 immediately. ```bash kubectl rollout undo deployment/payment-service -n payments kubectl rollout status deployment/payment-service -n payments ``` **14:54** - Rollback completes. All pods return to healthy status. Payments resume. --- ### What Went Wrong **1. Memory Leak in Cache Dict** The coupon feature cached query validations in a module-level dictionary that was never pruned or expired: ```python # The Bug: unbound dictionary grows infinitely in RAM _validation_cache = {} def validate_coupon(code: str, user_id: int) -> bool: key = f"{code}:{user_id}" if key not in _validation_cache: # DB record cached in memory _validation_cache[key] = db.query(Coupon).filter_by(code=code).first() return _validation_cache[key].is_valid ``` **2. Restrictive Memory Limits** The pod memory limit was hardcoded to `256Mi` based on legacy development profiles. Active production usage was already averaging `190Mi`, leaving no headroom for growth. **3. Missing Autoscaling** No HPA was configured. When one container was terminated, the cluster did not scale out to handle the distributed traffic. --- ### What We Changed **Fix 1: Remove the Memory Leak** The inline dictionary was removed, replacing validation lookups with a fast direct query. A Redis-backed cache with a strict TTL was added in a subsequent release. **Fix 2: Set Safe Memory Limits** We updated the manifest specifications to provide adequate memory buffer sizes: ```yaml resources: requests: memory: "256Mi" limits: memory: "512Mi" # Safe headroom for unexpected heap usage ``` **Fix 3: Implement HPA** We added a Horizontal Pod Autoscaler targeting memory limits: ```yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: payment-hpa namespace: payments spec: minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: memory target: type: Utilization averageUtilization: 75 ```

SRE Takeaway

Modern platform reliability depends on proactive bounds configuration. Insufficient CPU/memory parameters, missing timeout thresholds, or lack of auto-healing definitions are the root triggers for over 80% of cluster outages.

Want More Outage Stories?

This scenario represents one of the many real-world SRE issues covered in our premium reference manual. Get 156 production-tested scenarios and disaster recovery walkthroughs.

Kubernetes Interview Questions 156 Real Production Scenarios & Architectures
Read Book Scenarios

Related Incidents

Critical - P1
🔄 A pipeline sync bug that cleared our frontend S3 bucket

A wrong environment variable in a CI build runner caused aws s3 sync --delete to target the production bucket instead of staging, deleting all static web assets in seconds.

Major - P2
☁️ A database connection leak that triggered query timeouts under traffic

A background job worker failed to close database connection sockets, exhausting the RDS pool limits and taking down our API for 23 minutes.

Critical - P1
☁️ How a caching mistake served private user data to anonymous visitors

A CDN header optimization task stripped authorization tokens from cache key definitions, caching private profile pages and serving them to random visitors for 22 minutes.