Read Results

Read and interpret Statsig Warehouse Native experiment results, including scorecards, primary metrics, lift, intervals, and significance indicators.

Read experiment results

The Results tab shows your experiment's hypothesis, an Exposures chart, and a Scorecard of metric lifts. Open it to read and interpret how each variant performed.

Exposures

At the top of the Results page is the Exposures Chart. Exposures are the unique experimental units enrolled in the experiment. This is typically the number of unique users; for device-level experimentation, this is the number of devices. The timeline shows when the experiment started and how many exposures Statsig enrolled on each day. Use the chart to see the rate at which Statsig added users into each group and the total number of users exposed. You can also confirm whether the target ratio matches what you configured in experiment setup.

Scorecard

The experiment Scorecard shows the metric lifts for all Primary and Secondary metrics you set up at experiment creation.

Immediately post-experiment start

For up to the first 24 hours after starting your experiment (before the daily metric results run), Statsig calculates the Scorecard section hourly. This applies to Statsig Cloud only; for WHN projects you must reload results on demand or set up a daily schedule. This near-real-time scorecard lets you confirm that Statsig calculates exposures and metrics as expected and debug your experiment or gate setup if needed.

Do not make experiment decisions based on real-time results data in this first 24-hour window after experiment start. Make decisions only after the experiment reaches its target duration, which your primary metrics determine by reaching experimental power. For more about target duration, go to target duration.

Because data in this early post-experiment window supports diagnostics rather than decision-making, there are a few key differences from the results shown after daily runs begin:

Metric lifts don't have confidence intervals
No time-series view of metric trends
No projected topline impact analysis
No option to apply more advanced statistical tactics, such as CUPED or Sequential Testing

All of these are available in daily Results, which start showing in the next daily run.

Post-first day scorecard

Experiment scorecard table displaying metric lifts and confidence intervals

The experiment Results daily run calculates the difference between comparable randomization groups (for example, test and control) across your organization's suite of metrics. The daily run then applies a statistical test to the results. For more about Statsig's stats engine, go to the stats engine documentation.

For every metric, Statsig shows:

The calculated relative difference (Delta %)
The confidence interval
Whether the result is statistically significant
- Positive lifts are green
- Negative lifts are red
- Non-significant results are grey

The formula for calculating lift is:

Delta(%) = (Test - Control) / Control

Statsig reports confidence intervals at the selected significance level (95% by default). In a typical two-sided Z-test, Statsig shows the confidence interval as +/- 1.96 * standard error.

Statsig automatically applies 99.9% winsorization to event_count, event_count_custom, and sum metrics. Winsorization caps extreme outlier values to reduce their impact on experiment results. For metrics added to the Scorecard or Monitoring Metrics sections of your experiment or gate, you can also apply optional statistical treatments. Examples include CUPED (pre-experiment bias reduction) and sequential testing adapted confidence intervals. For more details, go to the stats engine documentation.

Statsig computes experiment results for the first 90 days: By default, Statsig computes experiment results for only the first 90 days of your experiment. You receive an email notification as you approach the 90-day limit, at which point you can extend the compute window by another 30 days at a time. If the experiment runs beyond the compute window, Statsig stops adding new users to the experiment's results. Analysis for existing exposed users continues until you make a decision on the experiment.

This experiment result calculation window only affects whether Statsig includes a user in the experiment's analysis, and doesn't affect the treatment each user receives. New users still receive the experience for the group Statsig randomizes them into.

Experiment results views

Statsig offers the following views for Scorecard metric lifts:

Cumulative results (default view): Displays the aggregate difference between experiment groups and visualizes the corresponding confidence intervals.
Table view: Displays the same data as the cumulative view but in a table format with additional fields.
Daily results: Shows the difference between experiment groups aggregated based on days since start of experiment.
Days since exposure: Shows the difference between experiment groups aggregated based on days since exposure to the experiment.

Cumulative results includes a detailed view on hover, where you can additionally view the raw statistics used in the metric lift calculations, as well as topline impact.

Cumulative results view with hover details

Dimensions

There are two ways to break down a Scorecard metric: by a User Dimension or by an Event Dimension.

User dimensions

User Dimensions refer to user-level attributes that are either part of the user object you log, or additional metadata that Statsig extracts. Examples of these user attributes include operating system, country, and region.

You can create custom "explore" queries to filter on or group by available user dimensions. For example, view results for users in the US, or results for iOS users grouped by country. Go to the "explore" tab to create a custom query.

Event dimensions

Event Dimensions refer to the value or metadata logged as part of a custom event used to define the metric. To view results for a metric broken down by categories specific to that metric, specify the dimension in the value or metadata attributes when you log the source event. For example, when you log a "click" event on your web or mobile application, you can log the target category using the value attribute. Statsig automatically generates results for each category in addition to the top-level metric.

To see breakdowns for all categories within a metric, click on the (+) sign next to the metric.

Significance level settings

You can adjust these settings at any time to view Scorecard results with different significance levels.

Apply Benjamini-Hochberg Procedure per Variant: Select this option to reduce the probability of false positives by adjusting the significance level for multiple comparisons. Go to Benjamini-Hochberg Procedure for details.
Confidence Interval: Changes the confidence interval displayed with the metric deltas. Choose a lower confidence interval (for example, 80%) when you have higher tolerance for false positives and prefer fast iteration with directional results over longer experiments with greater certainty.
CUPED: Toggle CUPED on or off using the inline settings above the metric lifts. This setting applies only to Scorecard metrics; Statsig doesn't apply CUPED to non-Scorecard metrics.
Sequential Testing: Applies a correction to p-values and confidence intervals to reduce false positive rates when evaluating results before the target completion date of the experiment. This mitigates the increased false positive rate associated with the "peeking problem". Toggle Sequential Testing on or off using the inline settings above the metric lifts. This setting works only for experiments with a set target duration.

Restarting results

If your experiment has stopped computing results, you can resume updates by selecting the Restart button. Before restarting, review the following:

A Restart isn't a Reset. A Restart doesn't re-randomize units in your experiment, and all users continue to receive the same group assignments.
Statsig begins computing experiment results from the restart point, so metric results start over. Old results may still appear in time series and explore query views, but Statsig doesn't carry them forward or update them.
The Cumulative Exposures chart updates based on new exposures, but the duration of the pause in computations affects whether the chart starts from zero or retains past exposure counts.

Avoid restarting results by actively extending experiments while they are running. Monitor email alerts from Statsig and check your experiments regularly.

Was this helpful?