Chaos Engineering¶
Design experiments that surface real weaknesses in production systems — without becoming outages. Most "chaos engineering" attempts skip steady-state measurement, define no abort criteria, and have no blast-radius bound. This skill enforces the discipline that makes chaos experiments safe and useful.
When to use¶
- Planning a chaos experiment (what to break, where, when, how to abort)
- Calculating blast radius before running the experiment
- Reviewing an existing experiment plan for safety
- Choosing a chaos tool (Chaos Toolkit / Chaos Mesh / Litmus / Gremlin / AWS FIS)
- Writing a chaos experiment postmortem
- Running a Game Day exercise
When NOT to use¶
- General incident response (use
incident-response) - Threat hunting / red-team (use
red-team,threat-detection) - Performance load testing (different goal — chaos is about failure modes, not capacity)
- Production debugging (chaos discovers weaknesses preemptively, not after-the-fact)
Core principle: chaos without abort criteria is an outage¶
The 4 Principles of Chaos Engineering (Netflix, 2016):
- Build a hypothesis around steady-state behavior. Not "what breaks?" but "X holds; will it still hold under fault Y?"
- Vary real-world events. Inject realistic failures: kill nodes, slow networks, lose cache, throttle dependencies.
- Run experiments in production. Staging never has the same failure modes. Start small.
- Automate experiments to run continuously. One-off chaos is a press release; continuous chaos is engineering.
Add a fifth: Define abort criteria up front. A chaos experiment with no abort criteria is an outage by another name.
Quick start¶
SKILL=engineering/chaos-engineering/skills/chaos-engineering
# 1. Design an experiment
python "$SKILL/scripts/experiment_designer.py" --target "checkout-svc" --hypothesis "p99 latency stays <500ms" --attack latency --duration-min 15
# 2. Calculate blast radius
python "$SKILL/scripts/blast_radius_calculator.py" --traffic-share 0.05 --user-pop 1000000 --duration-min 15
# 3. Generate postmortem after the experiment
python "$SKILL/scripts/experiment_postmortem.py" --plan experiment.json --result-log results.txt
The 3 Python tools¶
All stdlib-only. Run with --help.
experiment_designer.py¶
Generates a structured experiment plan from inputs. Enforces the required sections (hypothesis, steady-state metric, blast radius, abort criteria, rollback).
python scripts/experiment_designer.py \
--target "checkout-svc" \
--hypothesis "p99 latency stays <500ms when payment-svc is slow" \
--attack latency \
--magnitude "+200ms" \
--duration-min 15 \
--blast-radius "5% of US traffic" \
--abort-if "p99 > 1000ms OR error_rate > baseline + 1pp"
Outputs a markdown plan with: hypothesis, steady-state, attack, magnitude, duration, blast radius, abort criteria, rollback procedure, monitoring dashboards, and learning question.
blast_radius_calculator.py¶
Computes the blast radius of a planned experiment. Given traffic share + user population + duration, calculates expected affected users, expected error budget burn, and a risk score.
python scripts/blast_radius_calculator.py \
--traffic-share 0.05 \
--user-pop 1000000 \
--duration-min 15 \
--baseline-availability 0.999 \
--expected-impact-availability 0.95
Outputs: - Expected affected users - Error budget consumed (in minutes of error budget) - Risk score: GREEN / YELLOW / RED - Recommendation: PROCEED / REDUCE / ABORT
GREEN = <1% error budget; YELLOW = 1-10%; RED = >10%.
experiment_postmortem.py¶
Produces a structured postmortem from an experiment plan + results. Catches the common postmortem failure modes: no learning recorded, no follow-up actions, blame-laden language.
Outputs markdown with: summary, hypothesis (was it confirmed/refuted?), what we learned, what surprised us, follow-up actions with owners, and link to next experiment.
The 7 attack types (taxonomy)¶
Different attacks reveal different weaknesses. See references/attack_taxonomy.md for full detail.
| Attack | What it tests | Tooling |
|---|---|---|
| Latency | Timeouts, retries, circuit breakers | tc, Chaos Mesh NetworkChaos |
| Error | Error handling, fallback paths | Chaos Mesh HTTPChaos, Toxiproxy |
| Resource (CPU, memory, disk) | Saturation handling, autoscaling | Chaos Mesh StressChaos, stress-ng |
| Network partition | Split-brain, consensus, failover | Chaos Mesh NetworkChaos partition |
| Dependency failure | Graceful degradation, fallback | Service mesh fault injection |
| Time | Clock skew, NTP issues | libfaketime, Chaos Mesh TimeChaos |
| Infrastructure (kill instance) | Auto-recovery, failover | AWS FIS, Chaos Monkey |
Pick the attack that matches the hypothesis. "What happens if X is slow?" → latency. "What happens if X loses network?" → partition.
Tooling chooser¶
| Tool | Best for | Pricing | Stack |
|---|---|---|---|
| Chaos Toolkit | Lightweight, language-agnostic, JSON experiments | OSS | Any |
| Chaos Mesh | Kubernetes-native, rich CRDs, in-cluster | OSS | Kubernetes |
| Litmus | Kubernetes, Argo-integrated, large library | OSS + Enterprise | Kubernetes |
| Gremlin | Enterprise SaaS, multi-cloud, audit | Paid | Any |
| AWS FIS | AWS-native, IAM-integrated, EC2/ECS/EKS | Paid (AWS) | AWS |
| Custom | Niche needs, single-cloud, low budget | None | Any |
Decision rules: - k8s-only stack + OSS → Chaos Mesh or Litmus (Litmus has bigger experiment library) - Multi-cloud + OSS → Chaos Toolkit - AWS-heavy + simple needs → AWS FIS - Enterprise + audit/compliance → Gremlin
See references/tooling_landscape.md for trade-offs.
Workflows¶
Workflow 1: Design and run a single experiment¶
1. State a hypothesis: "When [fault], steady-state metric X stays within Y."
2. Identify the steady-state metric — must be measurable BEFORE the experiment.
3. Run blast_radius_calculator.py — confirm GREEN before proceeding.
4. Run experiment_designer.py to produce the plan.
5. Get a peer review of the plan; confirm abort criteria are concrete.
6. Notify the on-call team in #incidents (or whatever channel).
7. Run the experiment with monitoring open.
8. If abort criteria are hit, abort immediately; record what happened.
9. Run experiment_postmortem.py to capture learnings.
10. File follow-up actions; link to next experiment.
Workflow 2: Game Day exercise¶
1. Pick a scenario (e.g., "primary database fails over").
2. Identify all dependent services that should keep working.
3. Build a multi-experiment plan covering each layer.
4. Schedule with stakeholders; on-call coverage required.
5. Run with a facilitator who manages the scenario.
6. Capture observations in a shared doc as they happen.
7. Single combined postmortem covering all observations.
8. Track follow-up actions in a board with owners.
Workflow 3: Continuous chaos (game days → daily)¶
1. Start: weekly Game Day in staging.
2. Move to: weekly Game Day in production with limited blast radius.
3. Mature to: continuous chaos via scheduled experiments (Litmus chaos schedule, Gremlin scenarios).
4. Wire to deployment: every prod deploy triggers a baseline chaos sweep.
5. Track: experiments per week, weaknesses discovered, MTTR trend.
Composition with other skills¶
This skill explicitly composes with two others in this library:
| Skill | Composition |
|---|---|
feature-flags-architect |
Kill switches defined there are the abort triggers here |
kubernetes-operator |
Operators are common chaos targets (test reconcile under fault) |
incident-response |
Chaos experiments that escalate become incidents |
Anti-patterns¶
- No hypothesis — "let's break things" is sabotage, not engineering
- No steady-state metric — without a baseline, you can't tell if X broke
- No blast radius bound — full-prod experiment without limits = outage
- No abort criteria — see above; this is mandatory
- No on-call coverage — chaos without monitoring is unmonitored production
- Chaos in staging only — staging never has prod failure modes
- Chaos in dev — useless; dev has different failure modes from prod
- One-off chaos — single experiment is a press release; learning requires recurrence
- Blame-laden postmortem — record causes, not blame; teams stop running chaos otherwise
References¶
references/chaos_principles.md— the 4 principles, history, when to startreferences/experiment_design.md— hypothesis structure, steady-state metrics, abort criteriareferences/attack_taxonomy.md— 7 attack types with examples and toolingreferences/tooling_landscape.md— Chaos Toolkit / Mesh / Litmus / Gremlin / FIS / DIY
Slash command¶
/chaos-experiment — interactive experiment design wizard that runs all 3 tools.
Asset templates¶
assets/experiment_template.md— fill-in plan templateassets/postmortem_template.md— structured postmortem template
Verifiable success¶
A team using this skill should achieve:
- 100% of chaos experiments have a written hypothesis, abort criteria, and blast-radius calculation
- Blast radius for any single experiment never exceeds 10% of error budget
- Mean time between chaos experiments <14 days (continuous, not one-off)
- Each experiment produces ≥1 follow-up action that gets shipped
- No chaos experiment escalates to a customer-impacting incident in trailing 90 days