Stratified Sampling

Learn how stratified sampling reduces variance and improves experiment reliability in low volume or high variance scenarios.

What is stratified sampling

Stratified sampling involves dividing the entire population into homogeneous groups called strata (plural for stratum). Statsig then selects random samples from each stratum. For example, if you had XS and XL customers and randomized them into two groups (Control and Test), you want both Control and Test balanced across XS and XL customers. You can also stratify based on a metric such as Revenue/User.

With large numbers, randomization typically solves this balance problem. However, in B2B scenarios and other relatively low-volume or high-variance scenarios, stratified sampling ensures this balance. Statsig supports both automated and manual stratified sampling. In tests where a tail-end of power users drive a large portion of an overall metric value, stratified sampling meaningfully reduces false positive rates and makes results more consistent. In Statsig's simulations, this approach produced around a 50% decrease in the variance of reported results.

Automated stratified sampling

How it works

The Statsig SDKs use a salt to randomize or bucket experiment subjects (learn more). When you enable stratified sampling, Statsig tries n different salts (100 by default) and evaluates how balanced your groups are. Statsig evaluates this balance based on either a metric you select or an attribute you provide describing your experiment subjects, then picks and saves the best salt. Learn more.

The selection space for the salts is large enough that stratifying multiple experiments on the same metric doesn't result in overlap. In Statsig's simulations, the groups were as independent as expected, which matched the literature on this topic.

Enabling stratified sampling

You can enable stratified sampling on an experiment under Advanced Settings on the experiment setup page. There are two ways to stratify on Statsig. If you choose a metric to stratify on, Statsig uses that metric to balance the groups.

Stratified sampling metric selection interface

If you instead choose an attribute or a classification (for example, S, M, L, XL), Statsig uses that to balance the groups.

On Statsig Cloud, you upload a CSV (in Early Access).
On Statsig Warehouse Native, you use Entity Properties.

After you select the Stratify button, Statsig analyzes a set of salts and picks the best one.

Stratification analysis results interface

FAQ and best practices

What population does Statsig use when balancing?
- When evaluating salts, Statsig computes balance using pre-experiment data for the entire targeted population of the experiment’s unit type (for example, all userIDs or all customerIDs) over the selected lookback window. There is no filtering on exposure because the experiment hasn't started yet.
How does Statsig handle new units after stratification?
- The chosen salt still assigns units that weren't present in the pre-experiment data deterministically, i.e., effectively at random with respect to the balancing metric. These new units don't influence the salt selection and may introduce some drift from the initial balance.
Should I use stratified sampling for every experiment?
- Not necessarily. Stratified sampling is most useful when you expect imbalance due to heterogeneous units (for example, “whales”) or skewed metrics. The tradeoff is time and compute cost that scales with the number of units and adds steps before starting an experiment. If you don't expect meaningful imbalance, Statsig generally recommends a standard random split.
Does salt evaluation assume 100% allocation? What about running at less than 100%?
- Yes. Statsig evaluates all candidate salts assuming full (100%) allocation of the targeted population. If you run the experiment at an allocation below 100%, random sampling of that subset can reintroduce imbalance (for example, by chance, some high-impact units may fall disproportionately into one arm). For the period you care most about inference, prefer 100% allocation to preserve the intended balance. Use lower allocations briefly for safe rollouts rather than for the full experiment duration.
Across candidate salts, does Statsig evaluate the same set of users?
- Yes. Statsig assesses candidate salts over the same targeted population; only the randomization the salt induces changes.
How long does stratification take?
- Duration depends on the number of units and the metric/source Statsig queries. There is no fixed SLA; larger populations take longer.

Manual assignment for stratified sampling

When you set up an experiment, you can configure overrides (for example, force user X or Segment A into Control, force user Y or Segment B into Test). Overrides are for testing; Statsig excludes overridden users from experimental analysis in Pulse results. To use manual assignment for stratified sampling, select the Include Overrides in Pulse checkbox. This checkbox includes manually overridden users in each variant in all metric lift analyses. You can configure 100% of experiment participants into your test variants manually, or configure some subset of participants into variants manually and randomly assign the remainder.

While you can add overrides for an ID type that is different than the ID type of the experiment, those ID evaluations don't resolve to the id type of the experiment and don't contribute to pulse results.

When you use the Statsig SDK for assignment, it handles randomization. When you control assignment of users, you're responsible for balancing users across experiment groups.

Manual assignment override configuration

Morgan and Rubin 2012 walks through the history, the philosophy, and the proofs of re-randomization, especially how re-randomization reduces the randomization variance of the difference in means. Note that the paper calls out "Standard asymptotic-based analysis procedures that do not take the re-randomization into account will be statistically conservative". However, to maintain consistent and comparable results across different methods, Statsig stays conservative with the t-test.Lin & Ding 2019 is another interesting read for your reference.

Was this helpful?