Major - P2 AWS

A database connection leak that triggered query timeouts under traffic

Outage Synopsis:

A background job worker failed to close database connection sockets, exhausting the RDS pool limits and taking down our API for 23 minutes.

## The Incident **Duration:** 23 minutes of complete API failure. **Impact:** Client applications unable to query data; user dashboard displayed connection timeout exceptions. **Root Cause:** Database session connections left open in a celery worker background loop, exhausting the database host max connection limit. --- ### Timeline **03:00** - Scheduled nightly data sync job starts on background workers. **03:10** - First alert triggers: DB connection count exceeds 90% threshold. **03:12** - API endpoints return 500 Internal Server Error: `OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections`. **03:15** - On-call SRE attempts to log in to the database via SSH bastion, but connection attempts time out due to exhaustion. **03:18** - SRE kills background celery worker instances to force-close active TCP sockets. **03:21** - Connection count drops to normal levels. SRE restarts workers with limited concurrency. **03:23** - API returns to healthy status. --- ### What Went Wrong **1. Leaked Connections in Loop** The background job was configured to process data arrays, establishing a database connection per item inside the execution loop instead of utilizing a shared context pool: ```python # The Bug: session is never closed inside loop for item in dataset: session = SessionLocal() process_data(session, item) # Missing session.close() call ``` **2. Missing Connection Pool Sizing** The database instance (`db.t3.micro`) had a connection limit of 85. The background worker spun up dozens of threads, exhausting the pool in minutes. --- ### What We Changed **Fix 1: Context Managers for DB Sessions** We modified database query scripts to enforce automatic closing using Python context managers: ```python # After fix: block ensures connection release for item in dataset: with db_session_scope() as session: process_data(session, item) ``` **Fix 2: Connection Pool Proxy** We integrated PgBouncer between application instances and the database to pool connection allocations and handle spikes gracefully.

SRE Takeaway

Modern platform reliability depends on proactive bounds configuration. Insufficient CPU/memory parameters, missing timeout thresholds, or lack of auto-healing definitions are the root triggers for over 80% of cluster outages.

Want More Outage Stories?

This scenario represents one of the many real-world SRE issues covered in our premium reference manual. Get 156 production-tested scenarios and disaster recovery walkthroughs.

Kubernetes Interview Questions 156 Real Production Scenarios & Architectures

Read Book Scenarios

Related Incidents

Critical - P1

A database connection leak that triggered query timeouts under traffic

SRE Takeaway

Want More Outage Stories?

Related Incidents

☁️ How a caching mistake served private user data to anonymous visitors

⚙️ How a single pod memory exhaustion crashed our payment system

🔄 A pipeline sync bug that cleared our frontend S3 bucket