Count Distinct Metrics

Count distinct metrics calculate the unique values observed in a column of a metric source. This is calculated per-unit, so the total is the number of unique unit-value pairs.

Use cases

Count distinct metrics calculate the number of unique values in a column of your metric source, counted per unit so the total is the number of unique unit-value pairs. They have two primary use cases:

Measuring interactions and surface area. For example, you might count the number of entities a user has engaged with on a video streaming platform, or measure if a new recommendation engine increases the diversity of products clicked.
As a denominator in ratio metrics, especially common when you want to normalize by a unit other than your experiment's unit of analysis. For example, a B2B experiment might run at the company level but measure "Clicks per USER" using a ratio metric of COUNT(clicks)/COUNT_DISTINCT(user_id).

Count-distinct metrics are more expensive to compute than count or unique unit count metrics, especially for very long experiments.

If you want to count distinct occurrences of the experiment's unit of assignment (e.g. the user_id in a user_id experiment), you should use a unit_count metrics instead. A unit_count metric achieves the same result but calculates and stores the metric data more efficiently.

In many cases a count metric serves as a close proxy to count-distinct; you can also set up a data source to track unique instances of a key to avoid re-running the distinct operation across experiment analyses.

Calculation

At the unit level, count distinct metrics use COUNT_DISTINCT on their input column.

At the group level, Statsig calculates the mean as the SUM of the unit-level COUNT_DISTINCTs, divided by the count of UNIQUE UNITS exposed to the experiment.

The SQL looks like the following:

sql

-- Unit Level
SELECT
  source_data.unit_id,
  exposure_data.group_id,
  COUNT(distinct source_data.value_column) as value
FROM source_data
JOIN exposure_data
ON
  -- Only include users who saw the experiment
  source_data.unit_id = exposure_data.unit_id
  -- Only include data from after the user saw the experiment
  -- In this case exposure_data is already deduped to the "first exposure"
  AND source_data.timestamp >= exposure_data.timestamp
GROUP BY
  source_data.unit_id,
  exposure_data.group_id;

-- Experiment
SELECT
  group_id,
  COUNT(distinct unit_id) total_units
FROM exposure_data
GROUP BY group_id;

-- Group Level
SELECT
  group_id,
  SUM(value)/SUM(total_units) as mean
FROM unit_data
JOIN group_data
USING (group_id)
GROUP BY group_id;

Methodology notes

In the metrics page view, Statsig uses APPROX_COUNT_DISTINCT (or equivalent) to avoid large compute jobs on analytical count distinct. The approximate error is acceptably small for topline estimates. For experiment result loads, the calculation is analytical and exact to avoid jitter or bias from approximation error.

Options

Metric Breakdowns
- You can configure Metadata Columns to group results by, getting easy access to dimensional views in pulse results.
Winsorization
- Specify a lower and/or upper percentile bound to winsorize at. Statsig clamps all values below the lower threshold, or above the upper threshold, to that threshold to reduce the outsized impact of outliers on your analysis.
CUPED
- Specify if you want to calculate CUPED, and the lookback window for CUPED's pre-experiment data inputs.
Thresholding
- Turn this metric into a 1/0 unit count metric counting if the unit's total count equals to or surpasses (>=) a given threshold.
Cohort Windows
- You can specify a window for data collection after a unit's exposure. For example, a 0-1 day cohort window counts only actions from days 0 and 1 after a unit is exposed to an experiment.
  - Select Only include units with a completed window to remove units from pulse analysis for this metric until the cohort window has completed.

Limits

Count distinct metrics work in most experiments, except for Switchbacks.

Was this helpful?