Type to search the DevOpsManual references...

Press Esc to close
Critical - P1 CI/CD

A pipeline sync bug that cleared our frontend S3 bucket

Outage Synopsis:

A wrong environment variable in a CI build runner caused aws s3 sync --delete to target the production bucket instead of staging, deleting all static web assets in seconds.

## The Incident **Duration:** 8 minutes of complete outage. **Impact:** Public landing page and checkout portal displayed blank screens or XML AccessDenied logs for all visitors. **Root Cause:** Missing staging/production namespace validation in CI script + active `--delete` flag. --- ### Timeline **11:04** - Developer merges a patch to the staging branch, triggering the staging deployment pipeline. **11:05** - Staging deployment job starts. Due to a copy-paste error, the environment variable `BUCKET_NAME` was hardcoded to the production bucket identifier (`devopsmanual-prod-assets`). **11:06** - The deploy runner executes: ```bash aws s3 sync ./dist s3://$BUCKET_NAME --delete ``` Because the build directory `./dist` only contained staging assets, the `--delete` flag instructed AWS to delete all existing files in the production bucket that did not match the build payload. Within 15 seconds, 95% of production static files were purged. **11:07** - Automated monitors alert on 404 file errors at the CDN layer. Engineers assemble on Zoom. **11:09** - Lead SRE identifies the bucket wipe and leverages S3 bucket versioning to initiate an automated rollback script. **11:13** - Rollback script completes, restoring deleted assets. The site is healthy. --- ### What Went Wrong **1. Lack of Environment Gates** The deploy script did not verify the target bucket suffix before executing destructive actions, allowing a staging job to target production S3 assets. **2. Destructive Delete Flag** Using `aws s3 sync --delete` in continuous integration builds is highly risky without bucket locks or confirmation checks. --- ### What We Changed **Fix 1: CI Environment Separation** We decoupled deployment permissions using IAM OIDC role authentication. The staging runner is now physically blocked from writing to the production bucket. **Fix 2: Stricter Build Checks** We added environment checks inside deployment scripts: ```bash if [[ "$ENV" != "production" && "$BUCKET_NAME" == *"-prod-"* ]]; then echo "CRITICAL: Staging job attempted to write to production bucket!" exit 1 fi ``` **Fix 3: MFA Delete & Object Versioning** Object versioning was set to active on the S3 bucket with strict MFA delete requirements for object purges.

SRE Takeaway

Modern platform reliability depends on proactive bounds configuration. Insufficient CPU/memory parameters, missing timeout thresholds, or lack of auto-healing definitions are the root triggers for over 80% of cluster outages.

Want More Outage Stories?

This scenario represents one of the many real-world SRE issues covered in our premium reference manual. Get 156 production-tested scenarios and disaster recovery walkthroughs.

Kubernetes Interview Questions 156 Real Production Scenarios & Architectures
Read Book Scenarios

Related Incidents

Major - P2
🔄 How an ArgoCD sync loop saturated our cluster CPU and crashed our monitoring

A misconfigured Helm chart created an infinite resource synchronization loop in ArgoCD, causing controller CPU utilization to spike to 100% and taking down our Prometheus instance.

Critical - P1
⚙️ How a single pod memory exhaustion crashed our payment system

A memory leak in a new coupon feature caused cascading Out-Of-Memory (OOM) kills across our payment microservices namespace, leading to 12 minutes of complete downtime under load.

Major - P2
☁️ A database connection leak that triggered query timeouts under traffic

A background job worker failed to close database connection sockets, exhausting the RDS pool limits and taking down our API for 23 minutes.