Holdouts

Measure the cumulative impact of multiple features with holdouts, including how Holdout Pulse compares held-out users against a balanced non-holdout group.

Holdouts measure the aggregate impact of multiple features. A holdout keeps a group of users back from a set of features for measurement. Each A/B test or experiment compares control and test groups for a single feature. A holdout compares the holdout group (Control) against a balanced group of users who weren't held out. This balanced group continued through the normal rollout or experiment behavior for the included features.

How to use holdouts

Navigate to the Holdouts section on the Statsig console (a specialized kind of experiment) and click Create New.
Enter the name, description, and unit type for the holdout.
Choose a global or selected holdout. Statsig automatically adds a global holdout to any new feature with the same unit type. A global holdout captures the aggregate impact of all features developed after the holdout began, and you can opt individual features out as needed. A selected holdout captures the aggregate impact of a specific set of features.
By default, holdouts apply to a percentage of all users (Population = Everyone). To target a subset of users, apply a Targeting Gate (Population = Targeting Gate). For example, to create an iOS-only holdout, apply a Targeting Gate that passes only iOS users.
Set the holdout percentage between 1% and 10%. Statsig recommends a small holdout percentage to limit the number of users who don’t see new features.

How to read holdouts

Holdouts use the same “equal variant” methodology as Feature Gate rollouts: Statsig computes metric lifts using equal-sized groups to calculate holdout lift. Go to “A/B Testing Intuition Busters: Common Misunderstandings in Online Controlled Experiments” by Ron Kohavi, Alex Deng, & Lukas Vermeer for more on the advantages of this methodology.

Accordingly, the Cumulative Exposures panel for a given Holdout shows total exposures of the Holdout, broken down into three groups:

In holdout (Control): Units that Statsig included in the Holdout and used for analysis.
Not in holdout (Test); used for analysis: Units that Statsig didn't include in the Holdout but selected for comparison against the holdout group.
Not in holdout (Test); not used for analysis: Units that Statsig didn't include in the Holdout and didn't use in the lift calculation.

For units not included in the Holdout, Statsig generates the two "Not in holdout" groups using random sampling. Statsig sizes the group used for analysis to balance the comparison against the holdout group.

Holdout metric lifts represent the cumulative impact of launched and active experiments on the Holdout group. Statsig compares the Holdout group against the same percentage of the rest of the population, which continues through normal behavior for the included rollouts and experiments. The "Not in holdout (Test); used for analysis" group isn't necessarily made up of users who saw every treatment. Those users follow the normal non-holdout behavior for each included gate or experiment.

In the example below, the 1% Holdout compares metric values of users in Holdout vs. 1% of users not in Holdout. The comparison doesn't include the full remaining 99% of users. The launched features are having an overall negative effect on the "Add to Cart" metric.

Holdout pulse results showing metric lift comparison between holdout and exposed users

Best practices

Size: Statsig recommends a low single-digit holdout percentage, such as 1%–2%, to limit the number of customers who don't see new features.
Duration: Statsig recommends operating holdouts for three to six months, then releasing the holdout. Prolonging the holdout period may increase software complexity, because you must maintain a functioning product with no new features for a longer period.
Back testing: Occasionally you may want to turn off a set of already-released features to measure their effectiveness. Statsig doesn't recommend this approach because it turns off features that users are already using. However, when a "back measurement" is critical, you can use Holdouts to turn off a set of features and automatically compute their impact.

Unit ID types

By default, holdouts use User ID. To use a different ID type, select it from the drop-down menu during holdout creation.

You can apply holdouts only to Experiments and Feature Gates that use the same randomization unit. If a team plans to run experiments on both User ID and Stable ID, you need two separate holdouts to evaluate the cumulative impact of each type of experiment.

Holdout effects on gates & experiments SDK methods

Feature flags/gates

For users in holdout, gates always return False.

Experiments

For users in holdout, if the experiment isn't in a Layer, calls to get experiment parameters always return the "default value" passed in code.
For users in holdout, if the experiment is in a Layer, calls to get experiment parameters return the values defined in the Layer defaults in the Statsig console.

When you ship an experiment in a layer, Statsig normally updates the layer defaults. However, users in the holdout don't see those defaults. Instead, the layer has a separate set of default parameters only for held-out users:

Ending a holdout

To end a holdout and allow users in the holdout group to see all held-out features, disable the holdout. Disabling the holdout stops tracking the effects of those features, but Statsig retains the results for future reference.

Alternatively, delete the holdout if you created it by mistake or if you no longer need to keep the results.

Was this helpful?