Sequential Probability Ratio Tests
What is SPRT?
The Sequential Probability Ratio Test (SPRT) is another, advanced methodology for running AB tests, differing from the traditional Null Hypothesis Significance Test (commonly called Frequentist analysis). SPRT can meaningfully improve time to decision for your experiments, including detecting unwanted metric regressions much faster. It also tends to be much easier to share results to stakeholders who aren't super familiar with P-values and Significance levels. Lastly, SPRT has no penalties for peeking; there's no need for sequential testing plans, Alpha spending, or CI-penalties as SPRT is built to be a sequential test methodology from the start.
Concepts
SPRT introduces a few key concepts that differ from standard Frequentist tests. At its core, SPRT relies on the Likelihood Ratio (LR) and Upper and Lower decision boundaries, A and B.
The Likelihood Ratio estimates the relative difference in the likelihood of two outcomes:
- Numerator: What you observe is due to an alternative hypothesis (you set) being correct.
- Denominator: What you observe is due to the null hypothesis being correct.
The Upper and Lower decision boundaries are determined by your joint tolerances for Type I and Type II errors.
- A: If LR exceeds this upper threshold, you should accept the Alternative Hypothesis.
- B: If LR is less than this lower threshold, you should accept the Null Hypothesis.
- When LR falls into the range between these thresholds, no decision can be made and you should continue collecting data.
An LR of 5.8, for example, indicates that the what you observed is 5.8x more likely under the alternative hypothesis as compared to the null hypothesis.
One of the nice things about SPRT is that this Likelihood Ratio is similar to how most people think about comparing options. Rather than reporting P-values and Significance levels, you can now report a result like "With an LR of 3.5, it's 3.5x more likely that the feature worked."
Why SPRT?
- Faster Decisions: SPRT allows you to reach conclusions more quickly, potentially reducing experiment run time.
- Intuitive Results: Instead of p-values, SPRT uses the Likelihood Ratio, a more intuitive measure of evidence for or against your hypotheses.
- Sequential Analysis: Data is continuously evaluated as it is collected, allowing for early stopping when sufficient evidence is reached. There's no penalty for "peeking" in SPRT experiments.
- Clear Outcomes: SPRT enables you to confidently accept either the Null or Alternative hypothesis, rather than just “rejecting the null.”
- Data-Informed: Statsig’s implementation uses your past data and power analysis to inform the likelihood calculations and decision thresholds.
Comparing SPRT to other analysis methods
Category | Frequentist | SPRT |
---|---|---|
Test Statistic | P-value | Likelihood Ratio |
Decision Threshold | Alpha | A & B |
Decision Framework | Reject/Fail to Reject the Null | Accept the Null, Accept the Alternative Hypothesis, Or Continue |
Allows Peeking | Yes, but with Sequential Testing Penalties | Yes, Unlimited |
Requires Pre-Setup | No, but highly recommended | Yes, per metric |
Allows 1- and 2-Sided tests | Yes, per metric | Yes, per metric |
How to Use SPRT in Statsig
Enabling SPRT: Select SPRT as your analysis method when setting up an AB test in the Statsig console.
Interpreting Results: The experiment Results tab shows the latest likelihood ratio for each metric in your experiment and indicates when a decision boundary has been reached, allowing you to accept the null or alternative hypothesis with confidence.
Computing SPRT Results
Statsig uses an updated version of Hajnal's two-sample t test, as modified by Derek Ho of Atlassian (ref TBD), in our SPRT calculations.
On each day, compute the following for a comparison between any two groups A and B for a specific metric:
where:
- is the PDF of a normal distribution of shape evaluated at
- is the observed Z-statistic between the groups
- is derived from Cohen's d set prior to the experiment for the particular metric being considered
- and are the number of observed units for each group