p-Value Calculation

What p-values mean in Statsig experiments, how they are computed, and how to interpret them alongside confidence intervals and lift estimates.

In Null Hypothesis Significance Tests, the p-value is the probability of observing an effect larger than or equal to the measured metric delta. This probability assumes that the null hypothesis is true. A p-value below the pre-defined Type I Error threshold ( $\alpha$ ) serves as evidence of a true effect.

The methodology for p-value calculation depends on the number of degrees of freedom (

\nu

). A two-sample z-test is appropriate for most experiments. Statsig uses Welch's t-test for smaller experiments with

\nu < 100

. In both cases, the p-value depends on the metric mean and variance computed for the test and control groups.Typically, a p-value below the threshold

\alpha

occurs only when the confidence interval doesn't cross 0. However, an exception can occur in the Statsig UI. The p-value of the absolute difference between test and control can be statistically significant, while uncertainty in the control causes the relative delta confidence interval to cross zero (using The Delta Method) or appear as a point estimate (using Fieller Intervals).

Two-sample tests

Two-sided z-test

You can compute the z-statistic (also known as the z-score) of a two-sample z-test in multiple equivalent formats:

\begin{split} Z &= \frac{\overline X_t - \overline X_c}{\sqrt{var(\overline X_t)+ var(\overline X_c)}} \\ &= \frac{\overline X_t - \overline X_c}{\sqrt{var(\Delta \overline{X})}} \\ &= \frac{\overline X_t - \overline X_c}{\sqrt{\sigma_{\overline{X}_t}^2 + \sigma_{\overline{X}_c}^2}} \end{split}

where:

$Z$ is the observed z-statistic (not the z-critical value $Z_{\alpha/s}$ )
$var(\Delta \overline{X})$ is the variance of the absolute delta of means
$var(\overline{X}_i)$ is the variance of sample means either control or treatment group (details here)
$\sigma_{\overline{X}_t}$ is the standard error of the mean of either control or treatment group (these are the terms you can find in Pulse under the Statistics tab of a metric)

The two-sided p-value comes from the standard normal cumulative distribution function:

p-value = 2 \cdot \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{-|Z|}{e^{-t^2/2}dt}

Welch's t-test

For smaller sample sizes, Welch's t-test is preferable because it produces lower false positive rates in cases of unequal sizes and variances. In Pulse, Statsig automatically applies Welch's t-test when the degrees of freedom $\nu < 100$ .

Statsig computes the t-statistic (also known as t-score) identically to the two-sample z-statistic above. Statsig computes the degrees of freedom $\nu$ using:

\nu = \frac{\left(var(\overline X_t) + var(\overline X_c)\right)^2}{\frac{var(\overline X_t)^2}{N_t - 1}+\frac{var(\overline X_c)^2}{N_c - 1}} := \frac{var(\Delta\overline{X})^2}{\frac{var(\overline X_t)^2}{N_t - 1}+\frac{var(\overline X_c)^2}{N_c - 1}}

Statsig then obtains the p-value from the t-distribution with $\nu$ degrees of freedom.

One-sided z-test

The procedure for a one-sided z-test computes the z-statistic $Z$ in the same way as the two-sided test above.

The one-sided p-value comes from the standard normal cumulative distribution function, but with the following differences:

p-value = \begin{cases} 1 - \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{Z}{e^{-t^2/2}dt} &\text{if right-hand test}\\ \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{Z}{e^{-t^2/2}dt} &\text{if left-hand test} \end{cases}

where:

Statsig computes $Z$ as shown in the two-sided test above. This uses the signed z-statistic, not the absolute value used in the two-sided p-value.

Was this helpful?