Best Practices and Avoiding False Positives

Best practices for interpreting Statsig Warehouse Native experiment results, including reading lift, avoiding bias, and trusting statistical significance.

Interpreting Pulse results soundly means forming a hypothesis before you look, focusing on a small set of key metrics, and staying skeptical of borderline statistical significance. Follow these practices to avoid false positives and bias:

Form a hypothesis before viewing Pulse. Identify which metrics you expect to shift, what else could have happened, and what signals indicate something went wrong.
Establish a small set of key metrics directly related to your hypothesis. More than a handful of key metrics usually indicates an ill-defined hypothesis or unfocused experimentation. Examining too many metrics increases the false positive rate.
Avoid cherry-picking results. Don't selectively pick metrics that look good while ignoring those that don't. Avoid using numbers with no connection to your hypothesis. Statistically significant results should have a plausible explanation (a false positive is a plausible explanation).
Multiple independent effects that are consistent with a plausible explanation increase confidence that the observed effects are real, even with borderline p-values.
Expect false positives and be cautious about statistically significant results with borderline p-values. You can expect a 95% confidence interval (5% significance level) to produce one statistically significant result out of twenty by random chance alone. This rate increases if you include borderline metrics (for example, p = 0.06).
Look beyond your hypothesis for additional effects, tradeoffs, and unexpected behaviors. These can reveal information about how users interact with your product and often lead to follow-up experiments.

Was this helpful?