Major - P2 Networking

How a high DNS TTL value delayed our database migration by 12 hours

Outage Synopsis:

A legacy 86,400-second (24-hour) DNS TTL value caused client applications to continue writing to our old database master after DNS records were switched, delaying our migration.

## The Incident **Duration:** 12 hours of delayed database cutover. **Impact:** Double-writes across old and new database nodes; team forced to write sync scripts to merge split-brain datasets. **Root Cause:** Failure to reduce DNS Time-To-Live (TTL) records in advance of a database host migration. --- ### Timeline **00:00** - Scheduled database maintenance window starts. SRE switches the master database endpoint DNS record from the legacy cluster to the new Aurora database cluster. **00:15** - Maintenance checks reveal that client apps are still querying the legacy database. **00:30** - SRE discovers that legacy DNS records had a TTL configuration of 86,400 seconds (24 hours). Edge ISPs and client devices are caching the old target IP address. **01:00** - SRE attempts to force application restarts, but third-party clients and internal microservices continue to route traffic to the old database. **02:00** - SRE halts the migration. Old and new databases are out of sync, requiring manual sync scripts to merge split-brain data. **12:00** - DNS caches finally expire, completing the database migration cutover. --- ### What Went Wrong **1. Failure to Prune TTL in Advance** DNS TTL records dictate how long systems cache IP resolutions. Changing a DNS target without reducing the TTL in advance causes resolvers to continue caching the old record. **2. Missing DNS Pre-Flight Checklist** The migration checklist did not contain steps to check and reduce the TTL to 300 seconds (5 minutes) one week prior to the migration. --- ### What We Changed **Fix 1: Pre-Migration DNS Checks** We updated our architecture review procedures to enforce a 300-second TTL on all production DNS records. **Fix 2: Database Connection Capping** We configured the legacy database cluster to immediately reject new connections during migrations, forcing clients to reconnect and query DNS records.

SRE Takeaway

Modern platform reliability depends on proactive bounds configuration. Insufficient CPU/memory parameters, missing timeout thresholds, or lack of auto-healing definitions are the root triggers for over 80% of cluster outages.

Want More Outage Stories?

This scenario represents one of the many real-world SRE issues covered in our premium reference manual. Get 156 production-tested scenarios and disaster recovery walkthroughs.

Kubernetes Interview Questions 156 Real Production Scenarios & Architectures

Read Book Scenarios

Related Incidents

Critical - P1

How a high DNS TTL value delayed our database migration by 12 hours

SRE Takeaway

Want More Outage Stories?

Related Incidents

⚙️ How a single pod memory exhaustion crashed our payment system

🔄 A pipeline sync bug that cleared our frontend S3 bucket

☁️ A database connection leak that triggered query timeouts under traffic