p-Value Calculation
What p-values mean in Statsig Warehouse Native experiments, how they are computed, and how to interpret them alongside intervals and lift estimates.
In Null Hypothesis Significance Tests, the p-value is the probability of observing an effect at least as large as the measured metric delta, under the assumption that the null hypothesis is true. In practice, a p-value below the pre-defined Type I Error threshold ($\alpha$) serves as evidence of a true effect.
The methodology for p-value calculation depends on the number of degrees of freedom ($\nu$). A two-sample z-test is appropriate for most experiments. Welch's t-test is used for smaller experiments with $\nu < 100$. In both cases, the p-value depends on the metric mean and variance computed for the test and control groups.Typically, a p-value below threshold $\alpha$ occurs only when the confidence interval does not cross 0. However, this is not always the case in the Statsig UI. When the p-value of the difference between test and control is significant, the relative delta confidence interval may still cross zero (when using The Delta Method) or appear as a point estimate (when using Fieller Intervals), while the absolute difference's p-value remains statistically significant.Two-Sample Tests
Two-Sided z-Test
You can compute the z-statistic (also called the z-score) of a two-sample z-test in multiple equivalent formats:
$$
\begin{split} Z &= \frac{\overline X_t - \overline X_c}{\sqrt{var(\overline X_t)+ var(\overline X_c)}} \ &= \frac{\overline X_t - \overline X_c}{\sqrt{var(\Delta \overline{X})}} \ &= \frac{\overline X_t - \overline X_c}{\sqrt{\sigma_{\overline{X}t}^2 + \sigma{\overline{X}_c}^2}} \end{split} $$
where:
- $Z$ is the observed z-statistic (not the z-critical value $Z_{\alpha/s}$)
- $var(\Delta \overline{X})$ is the variance of the absolute delta of means
- $var(\overline{X}_i)$ is the variance of sample means either control or treatment group (details here)
- $\sigma_{\overline{X}_t}$ is the standard error of the mean of either control or treatment group (these are the terms you can find in Pulse under the Statistics tab of a metric)
The two-sided p-value is obtained from the standard normal cumulative distribution function:
$$
p-value = 2 \cdot \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{-|Z|}{e^{-t^2/2}dt} $$
Welch's t-test
For smaller sample sizes, Welch's t-test is preferred because it produces lower false positive rates when group sizes and variances are unequal. In Pulse, Statsig automatically applies Welch's t-test when the degrees of freedom $\nu < 100$.
The t-statistic (also called the t-score) is computed identically to the two-sample z-statistic above. The degrees of freedom $\nu$ are computed using:
$$
\nu = \frac{\left(var(\overline X_t) + var(\overline X_c)\right)^2}{\frac{var(\overline X_t)^2}{N_t - 1}+\frac{var(\overline X_c)^2}{N_c - 1}}
:= \frac{var(\Delta\overline{X})^2}{\frac{var(\overline X_t)^2}{N_t - 1}+\frac{var(\overline X_c)^2}{N_c - 1}} $$
The p-value is then obtained from the t-distribution with $\nu$ degrees of freedom.
One-Sided Z-Test
The one-sided z-test computes the z-statistic $Z$ in the same way as the two-sided test. The one-sided p-value differs as follows:
$$
p-value = \begin{cases} 1 - \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{Z}{e^{-t^2/2}dt} &\text{if right-hand test}\ \frac{1}{\sqrt{2\pi}} \int \limits _{-\infty}^{Z}{e^{-t^2/2}dt} &\text{if left-hand test} \end{cases} $$
where:
- $Z$ is computed as in the two-sided test. This uses the signed z-statistic, not the absolute value used in the two-sided p-value.
Was this helpful?