Critical - P1 AWS

How a caching mistake served private user data to anonymous visitors

Outage Synopsis:

A CDN header optimization task stripped authorization tokens from cache key definitions, caching private profile pages and serving them to random visitors for 22 minutes.

## The Incident **Duration:** 22 minutes of compromised cache state. **Impact:** Random users viewed active dashboards of other logged-in profiles; potential exposure of private addresses and usernames. **Root Cause:** CloudFront cache policy optimization stripped the `Authorization` header from the cache key template, caching HTTP responses for user dashboard routes. --- ### Timeline **10:15** - SRE applies a CloudFront optimization policy to cache static dashboard assets. **10:18** - First customer reports seeing a different user's profile details upon login. **10:21** - Customer support receives multiple high-priority tickets indicating account mixups. **10:24** - SRE team determines that dashboard HTML responses are being cached by edge CDN servers. **10:28** - SRE disables CDN caching rules on dynamic user paths and issues a global cache invalidation: ```bash aws cloudfront create-invalidation --distribution-id E23X12345 --paths "/dashboard/*" "/api/user/*" ``` **10:37** - Cache invalidation completes. Users verify their dashboard access is resolved. --- ### What Went Wrong **1. Caching Dynamic Paths** The cache policy applied to `/dashboard/*` did not include session identifiers or authentication tokens in its cache key definitions, caching dynamic HTML responses. **2. Stripping HTTP Headers** A newly created policy optimized for query performance removed the `Authorization` header from key considerations, treating requests from different users as identical. --- ### What We Changed **Fix 1: Isolation of Dynamic Assets** We updated the CDN behavior configuration to block caching on dynamic paths entirely, using policy templates: ```json "CacheBehavior": { "PathPattern": "/dashboard/*", "MinTTL": 0, "MaxTTL": 0, "DefaultTTL": 0, "ForwardedValues": { "QueryString": true, "Headers": ["Authorization", "Host"] } } ``` **Fix 2: Automated CDN Testing** We implemented pre-deployment checks to verify cache headers (`Cache-Control: private, no-store`) on user endpoints.

SRE Takeaway

Modern platform reliability depends on proactive bounds configuration. Insufficient CPU/memory parameters, missing timeout thresholds, or lack of auto-healing definitions are the root triggers for over 80% of cluster outages.

Want More Outage Stories?

This scenario represents one of the many real-world SRE issues covered in our premium reference manual. Get 156 production-tested scenarios and disaster recovery walkthroughs.

Kubernetes Interview Questions 156 Real Production Scenarios & Architectures

Read Book Scenarios

Related Incidents

Major - P2

How a caching mistake served private user data to anonymous visitors

SRE Takeaway

Want More Outage Stories?

Related Incidents

☁️ A database connection leak that triggered query timeouts under traffic

⚙️ How a single pod memory exhaustion crashed our payment system

🔄 A pipeline sync bug that cleared our frontend S3 bucket