このページの内容

p-Value Calculation

What p-values mean in Statsig experiments, how they are computed, and how to interpret them alongside confidence intervals and lift estimates.

In Null Hypothesis Significance Tests, the p-value is the probability of observing an effect larger than or equal to the measured metric delta, under the assumption that the null hypothesis is true. A p-value below the pre-defined Type I Error threshold (α\alpha) serves as evidence of a true effect.

The methodology for p-value calculation depends on the number of degrees of freedom (ν\nu). A two-sample z-test is appropriate for most experiments. Statsig uses Welch's t-test for smaller experiments with ν<100\nu < 100. In both cases, the p-value depends on the metric mean and variance computed for the test and control groups.Typically, a p-value below the threshold α\alpha occurs only when the confidence interval doesn't cross 0. However, an exception can occur in the Statsig UI: when the p-value of the difference between test and control is statistically significant, but uncertainty in the control causes a relative delta confidence interval to cross zero (using The Delta Method) or be represented as a point estimate (using Fieller Intervals), while the absolute difference's p-value is statistically significant.

Two-sample tests

Two-sided z-test

You can compute the z-statistic (a.k.a. z-score) of a two-sample z-test in multiple equivalent formats:

Z=XtXcvar(Xt)+var(Xc)=XtXcvar(ΔX)=XtXcσXt2+σXc2\begin{split} Z &= \frac{\overline X_t - \overline X_c}{\sqrt{var(\overline X_t)+ var(\overline X_c)}} \\ &= \frac{\overline X_t - \overline X_c}{\sqrt{var(\Delta \overline{X})}} \\ &= \frac{\overline X_t - \overline X_c}{\sqrt{\sigma_{\overline{X}_t}^2 + \sigma_{\overline{X}_c}^2}} \end{split}

where:

  • ZZ is the observed z-statistic (not the z-critical value Zα/sZ_{\alpha/s})
  • var(ΔX)var(\Delta \overline{X}) is the variance of the absolute delta of means
  • var(Xi)var(\overline{X}_i) is the variance of sample means either control or treatment group (details here)
  • σXt\sigma_{\overline{X}_t} is the standard error of the mean of either control or treatment group (these are the terms you can find in Pulse under the Statistics tab of a metric)

The two-sided p-value comes from the standard normal cumulative distribution function:

pvalue=212πZet2/2dtp-value = 2 \cdot \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{-|Z|}{e^{-t^2/2}dt}

Welch's t-test

For smaller sample sizes, Welch's t-test is preferred because it produces lower false positive rates in cases of unequal sizes and variances. In Pulse, Statsig automatically applies Welch's t-test when the degrees of freedom ν<100\nu < 100.

Statsig computes the t-statistic (also known as t-score) identically to the two-sample z-statistic above. Statsig computes the degrees of freedom ν\nu using:

ν=(var(Xt)+var(Xc))2var(Xt)2Nt1+var(Xc)2Nc1:=var(ΔX)2var(Xt)2Nt1+var(Xc)2Nc1\nu = \frac{\left(var(\overline X_t) + var(\overline X_c)\right)^2}{\frac{var(\overline X_t)^2}{N_t - 1}+\frac{var(\overline X_c)^2}{N_c - 1}} := \frac{var(\Delta\overline{X})^2}{\frac{var(\overline X_t)^2}{N_t - 1}+\frac{var(\overline X_c)^2}{N_c - 1}}

Statsig then obtains the p-value from the t-distribution with ν\nu degrees of freedom.

One-sided z-test

The procedure for a one-sided z-test computes the z-statistic ZZ in the same way as the two-sided test above.

The one-sided p-value comes from the standard normal cumulative distribution function, but with the following differences:

pvalue={112πZet2/2dtif right-hand test12πZet2/2dtif left-hand testp-value = \begin{cases} 1 - \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{Z}{e^{-t^2/2}dt} &\text{if right-hand test}\\ \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{Z}{e^{-t^2/2}dt} &\text{if left-hand test} \end{cases}

where:

  • ZZ is computed as shown in the two-sided test above. This uses the signed z-statistic, not the absolute value used in the two-sided p-value.

Was this helpful?