Experiment Quality Score

Learn how to assess and improve the quality and trustworthiness of your experiments with Statsig's quality scoring system."

How experiment quality score works

The Experiment Quality Score is a metric that provides a quick measure of the quality and trustworthiness of an experiment configured in Statsig. The score helps experimenters and their peers quickly identify potential issues in experiment setup, execution, and data collection, enabling more confident decision-making. Measuring this score across all experiments helps teams discover systematic issues and identify opportunities to improve their experimentation program over time.

Configure experiment quality score

To enable the Experimentation quality score, go to Settings > Experimentation > Experiment Quality Score in the project settings.

Statsig evaluates experiments against a list of pre-defined assessment criteria. You can customize the weight of each criterion based on your organization's needs, though Statsig provides default values.

Experiment quality score configuration interface

Advanced configuration

For organizations with more complex requirements, you might need additional checks, different requirements per product team, or thresholds that differ from the defaults. For example, hypotheses might need to be at least 200 characters and contain a link to an external planning doc.

To manage these requirements, use the Statsig console API. Run a POST or PATCH on the console/v1/experiments endpoint to update individual scores on any experiment. Targeting the existing set of scores lets you override weights (usually to 0), so the list contains only the custom set you need.

For example, running patch on an experiment with this payload:

plaintext

{
    "manualQualityScores": [
        {
            "criteriaName": "HYPOTHESIS_LENGTH",
            "criteriaDescription": "Check passed",
            "status": "PASSED",
            "score": 0,
            "weight": 0
        },
        {
            "criteriaName": "MyCompany\'s Hypothesis Check",
            "criteriaDescription": "Has Internal URL and > 200 Chars",
            "status": "PASSED",
            "score": 100,
            "weight": 100
        },
        {
            "criteriaName": "Naming",
            "criteriaDescription": "Experiment prefixed with team name",
            "status": "FAILED",
            "score": 0,
            "weight": 100
        }
    ]
}

This payload:

Drops the original HYPOTHESIS_LENGTH check
Keeps the other original checks, with their weights
Adds a new check, MyCompany's Hypothesis Check, for custom logic on the hypothesis
Adds a new check, Naming, for custom logic on the name

Statsig normalizes the other weights. If the original HYPOTHESIS_LENGTH had a weight of 10, the total weight would be 290 and scores normalize accordingly. If all non-custom checks were passing, the score would be 190/290 or ~66%.

The general flow for using this approach:

Use the Console API's experiments/get to pull all experiments
For each experiment:
- Run custom logic
- Patch results

How scores are calculated

Statsig skips checks in an unready state during evaluation and renormalizes the other weights to 100%. For example, if the experiment hasn't started, the Balanced Exposures component is in an unready state, and Statsig ignores that component.

Statsig omits checks with a weight of 0 from the card entirely.

Viewing quality scores

When you enable quality scores, they appear in the details tab of an experiment. Statsig evaluates each applicable check and contributes the check to the displayed score.

Statsig color-codes the score based on the threshold it reaches.

>= 85% corresponds to passing/green.
>= 50% corresponds to warning/yellow.
< 50% corresponds to error/red.

Experiment quality score display with color-coded status

Quality scores are also available through the console API, which you can use for bulk data retrieval and analysis.

Was this helpful?