Count Distinct Metrics
Count distinct metrics calculate the unique values observed in a column of a metric source. This is calculated per-unit, so the total is the number of unique unit-value pairs.
If you want to count distinct occurrences of the experiment's unit of assignment (e.g. the user_id in a user_id experiment), you should use a unit_count metrics instead. This achieves the same result, but more efficiently calculates and stores the metric data.
In many cases a count metric serves as a close proxy to count-distinct; you can also set up a data source to track unique instances of a key to avoid re-running the distinct operation across experiment analyses.
Use cases
Count distinct metrics have two primary use cases:
- Measuring interactions and surface area. For example, you might count the number of entities a user has engaged with on a video streaming platform, or measure if a new recommendation engine increases the diversity of products clicked.
- As a denominator in ratio metrics, especially common when you want to normalize by a unit other than your experiment's unit of analysis. For example, a B2B experiment might run at the company level but measure "Clicks per USER" using a ratio metric of COUNT(clicks)/COUNT_DISTINCT(user_id).
Calculation
At the unit level, count distinct metrics use COUNT_DISTINCT on their input column.
At the group level, Statsig calculates the mean as the SUM of the unit-level COUNT_DISTINCTs, divided by the count of UNIQUE UNITS exposed to the experiment.
The SQL looks like the following:
-- Unit Level
SELECT
source_data.unit_id,
exposure_data.group_id,
COUNT(distinct source_data.value_column) as value
FROM source_data
JOIN exposure_data
ON
-- Only include users who saw the experiment
source_data.unit_id = exposure_data.unit_id
-- Only include data from after the user saw the experiment
-- In this case exposure_data is already deduped to the "first exposure"
AND source_data.timestamp >= exposure_data.timestamp
GROUP BY
source_data.unit_id,
exposure_data.group_id;
-- Experiment
SELECT
group_id,
COUNT(distinct unit_id) total_units
FROM exposure_data
GROUP BY group_id;
-- Group Level
SELECT
group_id,
SUM(value)/SUM(total_units) as mean
FROM unit_data
JOIN group_data
USING (group_id)
GROUP BY group_id;
Methodology notes
In the metrics page view, Statsig uses APPROX_COUNT_DISTINCT (or equivalent) to avoid large compute jobs on analytical count distinct. The approximate error is acceptably small for topline estimates. For experiment result loads, the calculation is analytical and exact to avoid jitter or bias from approximation error.
Options
- Metric Breakdowns
- You can configure Metadata Columns to group results by, getting easy access to dimensional views in pulse results
- Winsorization
- Specify a lower and/or upper percentile bound to winsorize at. Statsig clamps all values below the lower threshold, or above the upper threshold, to that threshold to reduce the outsized impact of outliers on your analysis.
- CUPED
- Specify if you want to calculate CUPED, and the lookback window for CUPED's pre-experiment data inputs
- Thresholding
- Turn this metric into a 1/0 unit count metric counting if the unit's total count equals to or surpasses (>=) a given threshold
- Cohort Windows
- You can specify a window for data collection after a unit's exposure. For example, a 0-1 day cohort window would only count actions from days 0 and 1 after a unit was exposed to an experiment
- Select Only include units with a completed window to remove units from pulse analysis for this metric until the cohort window has completed.
- You can specify a window for data collection after a unit's exposure. For example, a 0-1 day cohort window would only count actions from days 0 and 1 after a unit was exposed to an experiment
Limits
Count distinct metrics are available in most experiments, except for Switchbacks.
Was this helpful?