Experiment Designer¶
Design, prioritize, and evaluate product experiments with clear hypotheses and defensible decisions.
When To Use¶
Use this skill for: - A/B and multivariate experiment planning - Hypothesis writing and success criteria definition - Sample size and minimum detectable effect planning - Experiment prioritization with ICE scoring - Reading statistical output for product decisions
Core Workflow¶
- Write hypothesis in If/Then/Because format
- If we change
[intervention] - Then
[metric]will change by[expected direction/magnitude] -
Because
[behavioral mechanism] -
Define metrics before running test
- Primary metric: single decision metric
- Guardrail metrics: quality/risk protection
-
Secondary metrics: diagnostics only
-
Estimate sample size
- Baseline conversion or baseline mean
- Minimum detectable effect (MDE)
- Significance level (alpha) and power
Use:
- Prioritize experiments with ICE
- Impact: potential upside
- Confidence: evidence quality
- Ease: cost/speed/complexity
ICE Score = (Impact * Confidence * Ease) / 10
- Launch with stopping rules
- Decide fixed sample size or fixed duration in advance
- Avoid repeated peeking without proper method
-
Monitor guardrails continuously
-
Interpret results
- Statistical significance is not business significance
- Compare point estimate + confidence interval to decision threshold
- Investigate novelty effects and segment heterogeneity
Hypothesis Quality Checklist¶
- Contains explicit intervention and audience
- Specifies measurable metric change
- States plausible causal reason
- Includes expected minimum effect
- Defines failure condition
Common Experiment Pitfalls¶
- Underpowered tests leading to false negatives
- Running too many simultaneous changes without isolation
- Changing targeting or implementation mid-test
- Stopping early on random spikes
- Ignoring sample ratio mismatch and instrumentation drift
- Declaring success from p-value without effect-size context
Statistical Interpretation Guardrails¶
- p-value < alpha indicates evidence against null, not guaranteed truth.
- Confidence interval crossing zero/no-effect means uncertain directional claim.
- Wide intervals imply low precision even when significant.
- Use practical significance thresholds tied to business impact.
See:
- references/experiment-playbook.md
- references/statistics-reference.md
Tooling¶
scripts/sample_size_calculator.py¶
Computes required sample size (per variant and total) from: - baseline rate - MDE (absolute or relative) - significance level (alpha) - statistical power
Example: