Best Practices and Avoiding False Positives

Best practices for interpreting Statsig Warehouse Native experiment results, including reading lift, avoiding bias, and trusting statistical significance.

Follow these suggestions to interpret Pulse results in a scientifically sound way:

  1. Form a hypothesis before viewing Pulse. Identify which metrics you expect to shift, what else could have happened, and what signals indicate something went wrong.
  2. Establish a small set of key metrics directly related to your hypothesis. More than a handful of key metrics usually indicates an ill-defined hypothesis or unfocused experimentation. Examining too many metrics increases the false positive rate.
  3. Avoid cherry-picking results. Don't selectively pick metrics that look good while ignoring those that don't. Avoid using numbers with no connection to your hypothesis. Statistically significant results should have a plausible explanation (a false positive is a plausible explanation).
  4. Multiple independent effects that are consistent with a plausible explanation increase confidence that the observed effects are real, even with borderline p-values.
  5. Expect false positives and be cautious about statistically significant results with borderline p-values. A 95% confidence interval (5% significance level) is expected to produce one statistically significant result out of twenty by random chance alone. This rate increases if you include borderline metrics (for example, p = 0.06).
  6. Look beyond your hypothesis for additional effects, tradeoffs, and unexpected behaviors. These can reveal information about how users interact with your product and often lead to follow-up experiments.

Was this helpful?