Type to search the DevOpsManual references...

Press Esc to close
Critical - P1 AWS

How a caching mistake served private user data to anonymous visitors

Outage Synopsis:

A CDN header optimization task stripped authorization tokens from cache key definitions, caching private profile pages and serving them to random visitors for 22 minutes.

## The Incident **Duration:** 22 minutes of compromised cache state. **Impact:** Random users viewed active dashboards of other logged-in profiles; potential exposure of private addresses and usernames. **Root Cause:** CloudFront cache policy optimization stripped the `Authorization` header from the cache key template, caching HTTP responses for user dashboard routes. --- ### Timeline **10:15** - SRE applies a CloudFront optimization policy to cache static dashboard assets. **10:18** - First customer reports seeing a different user's profile details upon login. **10:21** - Customer support receives multiple high-priority tickets indicating account mixups. **10:24** - SRE team determines that dashboard HTML responses are being cached by edge CDN servers. **10:28** - SRE disables CDN caching rules on dynamic user paths and issues a global cache invalidation: ```bash aws cloudfront create-invalidation --distribution-id E23X12345 --paths "/dashboard/*" "/api/user/*" ``` **10:37** - Cache invalidation completes. Users verify their dashboard access is resolved. --- ### What Went Wrong **1. Caching Dynamic Paths** The cache policy applied to `/dashboard/*` did not include session identifiers or authentication tokens in its cache key definitions, caching dynamic HTML responses. **2. Stripping HTTP Headers** A newly created policy optimized for query performance removed the `Authorization` header from key considerations, treating requests from different users as identical. --- ### What We Changed **Fix 1: Isolation of Dynamic Assets** We updated the CDN behavior configuration to block caching on dynamic paths entirely, using policy templates: ```json "CacheBehavior": { "PathPattern": "/dashboard/*", "MinTTL": 0, "MaxTTL": 0, "DefaultTTL": 0, "ForwardedValues": { "QueryString": true, "Headers": ["Authorization", "Host"] } } ``` **Fix 2: Automated CDN Testing** We implemented pre-deployment checks to verify cache headers (`Cache-Control: private, no-store`) on user endpoints.

SRE Takeaway

Modern platform reliability depends on proactive bounds configuration. Insufficient CPU/memory parameters, missing timeout thresholds, or lack of auto-healing definitions are the root triggers for over 80% of cluster outages.

Want More Outage Stories?

This scenario represents one of the many real-world SRE issues covered in our premium reference manual. Get 156 production-tested scenarios and disaster recovery walkthroughs.

Kubernetes Interview Questions 156 Real Production Scenarios & Architectures
Read Book Scenarios

Related Incidents

Major - P2
☁️ A database connection leak that triggered query timeouts under traffic

A background job worker failed to close database connection sockets, exhausting the RDS pool limits and taking down our API for 23 minutes.

Critical - P1
⚙️ How a single pod memory exhaustion crashed our payment system

A memory leak in a new coupon feature caused cascading Out-Of-Memory (OOM) kills across our payment microservices namespace, leading to 12 minutes of complete downtime under load.

Critical - P1
🔄 A pipeline sync bug that cleared our frontend S3 bucket

A wrong environment variable in a CI build runner caused aws s3 sync --delete to target the production bucket instead of staging, deleting all static web assets in seconds.