On this page

Pre-Experiment Bias

How Statsig Warehouse Native detects pre-experiment bias caused by uneven user distributions between treatment and control groups before exposure.

In some cases, users in two experiment groups have meaningfully different average behaviors before any intervention is applied. If this difference persists after the experiment starts, the analysis may attribute that pre-existing difference to the intervention, making a result appear more or less impactful than it is. CUPED helps address this bias, but can't fully account for it.

Some metrics, such as retention, aren't viable candidates for CUPED and can't be easily adjusted.

Statsig measures the pre-experiment values of all scorecard metrics for all experiment groups and determines whether the values are significantly different and could cause misinterpretations. If Statsig detects bias, it notifies users and places a warning on relevant Pulse results.

How it works

Statsig provides a "Days Since Exposure" view to help identify novelty effects and pre-experiment effects. For example, the test group in the following experiment had a consistently higher mean than the control group in the week before the experiment started:

Pre-experiment bias visualization showing test group with consistently higher mean than control group

Statsig detects this bias by running the standard pulse calculation on the pre-experiment term (looking back one week in Cloud, and the configured CUPED lookback window in Warehouse Native), then calculating the p-value for the null hypothesis that the groups are identical. Statsig flags relevant results using logic that balances awareness against false positives from large numbers of scorecard metrics or groups.

What to Do

Pre-experiment bias can occur by chance and isn't always a major issue.

  • If the total delta is small, it may not meaningfully influence the interpretation of results.
  • If CUPED can account for the bias, the bias shouldn't affect results.

In many cases, treat this warning as informational and proceed, applying extra scrutiny to impacted metrics. This is appropriate when the metric isn't critical to the experiment or when directional movement matters more than the exact value. Additional experiment time may also reduce the bias if no systemic source exists, because new users dilute the imbalance.

If the metric is critical and the exact numerical value matters, consider resalting and restarting the experiment.

Was this helpful?