Best Practices and Avoiding False Positives
Best practices for interpreting Statsig experiment results, including avoiding common biases, reading lift correctly, and trusting statistical significance.
The following suggestions help you interpret Pulse in a scientifically sound way:
- Have a hypothesis in mind before viewing Pulse. What metrics do you expect to shift due to the change you made? What else could have happened? What are signs that something went wrong?
- Establish a small set of key metrics directly related to your hypothesis that would establish that the experiment worked. More than a handful of key metrics usually indicates an ill-defined hypothesis or unfocused experimentation. Examining too many metrics increases the false positive rate (seeing results when only statistical noise exists).
- Avoid cherry-picking results. Don't selectively pick three metrics that look good while ignoring two that don't. Also avoid picking "good" or "bad" numbers with no connection to your hypothesis. Context matters: statistically significant results should have a plausible explanation (a false positive is a plausible explanation).
- Multiple independent effects that are consistent with a plausible story lend credibility to the observed effects, even with borderline p-values.
- Expect to see false positives and be suspicious of statistically significant results with borderline p-values. For example, a 95% confidence interval (5% significance level) is expected to produce one statistically significant metric out of twenty due purely to random chance. This number increases when you include borderline metrics (for example, p = 0.06).
- Look beyond your hypothesis. What other effects can you find? Are there tradeoffs? Are there unexpected behaviors? These can reveal information about your users and how they interact with your product, and are often the source of follow-up experiments and new ideas.
Was this helpful?