On this page

Frequentist Sequential Testing

Learn how sequential testing addresses the peeking problem in A/B tests and enables early decision making with statistical rigor.

What's the problem with looking early in a "standard" A/B test

Traditional A/B testing best practices (t-tests, z-tests, etc.) require that the readout of experiment metrics occur only once, when the target sample size is reached (that is, when your design duration has been reached and you have the sample size you need). This approach is called a "Fixed Horizon Test": when designing an experiment, you set the number of units you want to observe and commit to analyzing results only after the dataset is complete.

Continuous experiment monitoring ("peeking") for the purpose of decision making results in inflated false positive rates (the peeking problem), which can be much higher than the expected rate at your desired significance level.

How peeking increases decision error rates

Continuous monitoring leads to inflated false positives because any time you consider ending an experiment early, you risk making an incorrect conclusion. At the core of a standard hypothesis test, you decide whether to "accept the null" hypothesis or "reject the null" hypothesis and accept the alternative. Any time you look early and allow the possibility of an early decision, you are potentially rejecting the null hypothesis even when the null hypothesis is correct.

Why early results can be misleading

Metric values and p-values always fluctuate to some extent due to noise during any experiment. Results can move into and out of statistical significance because of this noise, even when there is no real underlying effect. Noisy fluctuations result from random unit assignment and unpredictable user behavior, and can't be eliminated entirely. Noise levels vary by test, depending on what you are testing and who your users are. Tests also vary over time in the amount of noise they see, as adding more users and observing them longer tends to help random fluctuations even out.

Peeking introduces selection bias when it causes an experimenter to adjust the readout date. When an experimenter makes any early decision about results (for example, "is the result stat-sig; can we ship a variant early?") the chances increase that the decision is based on a temporary snapshot of results that are always fluctuating. The experimenter is potentially selecting a stat-sig result that wouldn't appear if the data were analyzed only once at the full, pre-determined completion of the experiment. In frequentist A/B test procedures, early decisions can only increase the false positive rate (declaring an experimental effect when there is none), even when the intention is to make a less-biased decision.

How Sequential Testing works for an A/B test

In Statsig's implementation of Sequential Testing, Statsig automatically adjusts p-values and confidence intervals for each preliminary analysis window to compensate for the increased false positive rate associated with peeking, as shown in the Results tab:

Sequential testing results visualization

In this example, the confidence intervals for each metric are expanded using the "wings" or "tabs". This is a quick visual indicator that sequential testing is enabled and shows how much the intervals have been expanded.

Statsig results table highlighting sequential testing adjusted confidence interval

In this example, the sequential testing adjustment determines whether the indicated result is declared stat-sig.

The goal of Sequential Testing is to enable early decision making when observations are sufficiently strong to outweigh random fluctuations, while limiting the risk of false positives. Although peeking is typically discouraged, regular monitoring with sequential testing is valuable in some cases:

  • Unexpected regressions: When experiments have bugs or unintended consequences that severely impact key metrics, sequential testing helps identify these regressions early and distinguishes significant effects from random fluctuations.
  • Opportunity cost: When a significant loss may result from delaying the experiment decision (such as launching a new feature ahead of a major event or fixing a bug), sequential testing can support an early decision if key metrics show improvement. Use caution: an early stat-sig result for certain metrics doesn't guarantee sufficient power to detect regressions in other metrics. Limit this approach to cases where only a small number of metrics are relevant to the decision.
Sequential testing can be used anywhere you do an experimental analysis. This includes your main experimental Results page as well as any custom queries.

Quick guides

Enable sequential testing results

In the Setup tab of your experiment, with Frequentist selected as your Analytics Type, enable Sequential Testing under the Analysis Settings section. You can toggle this setting at any time during the life of the experiment and don't need to enable it before the experiment starts.

Sequential testing configuration interface

Interpreting sequential testing results

Click Edit at the top of the metrics section in Pulse to toggle Sequential Testing on/off.

Pulse metrics sequential testing toggle

When enabled, Statsig applies an adjustment to results calculated before the target completion date of the experiment.

Sequential testing confidence interval visualization

The dashed line represents the expanded confidence interval resulting from the adjustment. The solid bar is the standard confidence interval computed without any adjustments. If the adjusted confidence interval overlaps with zero, the metric delta isn't stat-sig at the moment, and the experiment should continue as planned.

Sequential testing is a reliable way to make an early decision, particularly for early detection of regressions. Early decision-making often results in underpowered lift estimates with a high degree of uncertainty. If making the correct decision is important, use statistically significant sequential testing results. If accurate measurement is important, wait for full power as estimated by your pre-experimental power calculation. Statsig doesn't calculate statistical power on post-hoc experimental results (refer to section "Post-hoc Power Calculations are Noisy and Misleading" in Kohavi, Deng, and Vermeer, A/B Testing Intuition Busters).

How Statsig implements Sequential Testing

Two-Sided Tests

Confidence Intervals

Statsig uses mSPRT based on the approach proposed by Zhao et al. in this paper. The two-sided Sequential Testing confidence interval with significance level $\alpha$ is given by:

$$

CI^(\Delta \overline{X}) = \Delta \overline{X} \pm Z^_{\alpha/2} \cdot \sqrt{V} $$

where

  • $Z^*_{\alpha/2}$ is the z-critical value, modified for sequential testing:

$$

Z^*_{\alpha/2} = \sqrt{\frac{(V+\tau)}{\tau}\left(-2\ln(\alpha/2)-\ln(\frac{V}{V+\tau})\right)} $$

  • $V$ is the standard variance of the delta of means when computing variance. It can be obtained from the sample variance of the test and control group means:

$$

V = var(\Delta \overline X) = var(\overline X_t) + var(\overline X_c) = \frac{var(X_t)}{N_t} + \frac{var(X_c)}{N_c} $$

  • $\tau$ is the mixing parameter given by:

$$

\tau =(Z_{\alpha/2})^2\cdot\frac{var(X_t)+var(X_c)}{N_t+N_c} $$

  • $Z_{\alpha/2}$ is the z-critical value used in the non-sequential test, for the desired significance level (1.96 for the standard $\alpha = 0.05$)
Statsig has validated that this parameter satisfies the expected False Positive Rate and provides enough power to detect large effects early. More details on this analysis are available here.

p-Values

To produce p-values for sequential testing that are consistent with the expanded confidence intervals above, modify the p-value methods.

The goal is to evaluate the mSPRT test so that the Type I error remains approximately equal to $\alpha$, and so that the sequential testing p-value is consistent with the expanded confidence interval. (I.e. A CI that includes 0.0% should have p-value ≥ $\alpha$, and one that excludes 0.0% should have p-value < $\alpha$.)

The observed z-statistic (i.e. z-score) remains unchanged. Instead of evaluating $Z$ on a standard-normal distribution $N(0, 1)$, Statsig evaluates against another normal distribution $N(0, \sigma^2)$ with mean of zero and standard deviation $\sigma$. For a two-sided test, to limit the probability of an observed $Z$ exceeding $Z^*_{\alpha/2}$ (assuming the null hypothesis to be true) to $\alpha$, you can find the unknown parameter by solving for $\sigma$:

$$

\sigma=\frac{Z_{\alpha/2}^*}{\sqrt{2} \cdot erf^{-1}(1-\alpha)} $$

where $erf^{-1}$ is the inverse error function.

From here we can compute the two-sided sequential testing p-value as:

$$

\text{p-value}^* = 2 \cdot \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{-|Z|} \frac{e^{- \frac{t^2}{{2\sigma^2}}}}{\sigma}dt $$

where $Z$ is the observed z-statistic (i.e. z-score) as usual.

One-Sided Tests

Statsig modifies each step for one-sided sequential testing.

$$

CI^(\Delta \overline{X}) = \begin{cases} \left[\Delta \overline{X} - Z^**{\alpha} \cdot \sqrt{V}, \quad +\infty \right) & \text{if right-sided test} \ \ \left(- \infty, \quad \Delta \overline{X} + Z^*{\alpha} \cdot \sqrt{V} :\right] & \text{if left-sided test} \ \end{cases} $$

$$

\text{p-value}^* = \begin{cases} 1 - \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{Z} \frac{e^{- \frac{t^2}{{2\sigma^2}}}}{\sigma}dt \quad \text{if right-sided test} \ \ \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{Z} \frac{e^{- \frac{t^2}{{2\sigma^2}}}}{\sigma}dt \quad \text{if left-sided test} \ \end{cases} $$

where

  • $Z^*_{\alpha}$ is the one-sided test z-critical value, modified for sequential testing:

$$

Z^*_{\alpha} = \sqrt{\frac{(V+\tau)}{\tau}\left(-2\ln(\alpha)-\ln(\frac{V}{V+\tau})\right)} $$

  • $V$ is the same as for two-sided tests.

  • $\tau$ is the mixing parameter given by:

$$

\tau =(Z_{\alpha})^2\cdot\frac{var(X_t)+var(X_c)}{N_t+N_c} $$

  • $Z_{\alpha}$ is the one-sided z-critical value used in the non-sequential test, for the desired significance level (1.645 for the standard $\alpha = 0.05$)

  • $\sigma$ is solved via:

$$

\sigma = \begin{cases} \frac{Z_{\alpha}^}{\sqrt{2} \cdot erf^{-1}(1 - 2 \alpha)} & \text{if right-sided test} \ \ \frac{- Z_{\alpha}^}{\sqrt{2} \cdot erf^{-1}(2 \alpha - 1)} & \text{if left-sided test} \end{cases} $$

  • $Z$ is the (signed) observed z-statistic as usual (i.e. z-score)

Was this helpful?