p-Value Calculation
What p-values mean in Statsig experiments, how they are computed, and how to interpret them alongside confidence intervals and lift estimates.
In Null Hypothesis Significance Tests, the p-value is the probability of observing an effect larger than or equal to the measured metric delta, under the assumption that the null hypothesis is true. A p-value below the pre-defined Type I Error threshold () serves as evidence of a true effect.
The methodology for p-value calculation depends on the number of degrees of freedom (). A two-sample z-test is appropriate for most experiments. Statsig uses Welch's t-test for smaller experiments with . In both cases, the p-value depends on the metric mean and variance computed for the test and control groups.Typically, a p-value below the threshold occurs only when the confidence interval doesn't cross 0. However, an exception can occur in the Statsig UI: when the p-value of the difference between test and control is statistically significant, but uncertainty in the control causes a relative delta confidence interval to cross zero (using The Delta Method) or be represented as a point estimate (using Fieller Intervals), while the absolute difference's p-value is statistically significant.Two-sample tests
Two-sided z-test
You can compute the z-statistic (a.k.a. z-score) of a two-sample z-test in multiple equivalent formats:
where:
- is the observed z-statistic (not the z-critical value )
- is the variance of the absolute delta of means
- is the variance of sample means either control or treatment group (details here)
- is the standard error of the mean of either control or treatment group (these are the terms you can find in Pulse under the Statistics tab of a metric)
The two-sided p-value comes from the standard normal cumulative distribution function:
Welch's t-test
For smaller sample sizes, Welch's t-test is preferred because it produces lower false positive rates in cases of unequal sizes and variances. In Pulse, Statsig automatically applies Welch's t-test when the degrees of freedom .
Statsig computes the t-statistic (also known as t-score) identically to the two-sample z-statistic above. Statsig computes the degrees of freedom using:
Statsig then obtains the p-value from the t-distribution with degrees of freedom.
One-sided z-test
The procedure for a one-sided z-test computes the z-statistic in the same way as the two-sided test above.
The one-sided p-value comes from the standard normal cumulative distribution function, but with the following differences:
where:
- is computed as shown in the two-sided test above. This uses the signed z-statistic, not the absolute value used in the two-sided p-value.
Was this helpful?