Skip to main content

Setup Checklist

After you've connected your warehouse and set up both metrics and assignment sources, you can ensure your setup is complete and correct by checking the following items:

  1. Primary keys
  2. Timestamps
  3. Duplication
  4. Data availability

Once you've completed these checks, your offline results should align with those in the Statsig console when advanced features are disabled.

1. Primary keys

When setting up an experiment, you can select the unit of assignment, acting as the primary key to join the assignment with metrics. The assignment source and the metrics source must use the same primary key.

In an Analyze-Only experiment, this primary key can be selected from the unit IDs defined by your assignment source.

  • Ensure the unit ID in your assignment source matches the unit ID in your metrics source.

In an Assign and Analyze experiment, the primary key (unit ID) is generated by the Statsig SDK.

  • You can verify this unit ID in the statsig_forwarded_exposures table within the assignment sources.
  • You must either forward the unit ID to the SDK (docs) or utilize the SDK to manage your features and correspondingly generate the metrics table.

2. Timestamps

It is important to analyze metric data only after a user has been exposed to the experiment. Pre-experiment data should have no average treatment effect, and therefore its inclusion dilutes results. Statsig employs a timestamp-based join for this purpose, with an option for a date-based join for daily data. This should look like the SQL snippet below:

WITH 
metrics as (...),
exposures as (...),
joined_data as (
SELECT
exposures.unit_id,
exposures.experiment_id,
exposures.group_id,
metrics.timestamp,
metrics.value
FROM exposures
JOIN metrics
ON (
exposures.unit_id = metrics.unit_id
AND metrics.timestamp >=
exposures.first_timestamp
)
)
SELECT
group_id,
SUM(value) as value
FROM joined_data
GROUP BY group_id;

3. Exposure duplication

Exposure data must be de-duplicated before joining to ensure a single record per user. Many vendors further manage crossover users (users present in more than one experiment group), removing them from analysis and/or alerting them if this occurs with high frequency.

SELECT 
unit_id,
experiment_id,
MIN(timestamp) as first_timestamp,
COUNT(distinct group_id) as groups
FROM <exposures_table>
GROUP BY
unit_id,
experiment_id,
group_id
HAVING COUNT(distinct group_id) = 1;

4. Data availability

When comparing a platform analysis to an existing experiment analysis that may have been run in the past, it's possible that the underlying data has since fallen out of retention or has been otherwise deleted. To check this, you can compare the table's retention policy to the analysis dates used in your original experiment analysis to make sure the data still exists.

Make sure results match

After completing the above four steps, your offline analysis should produce results that match those in the Statsig console. Note that the Statsig Console includes several advanced features, such as winsorization, CUPED, and employs the delta method to address ratio metrics. We recommend disabling these features initially when comparing results.

This article provides an example of conducting offline calculations in Databricks.

If you have additional questions, just send us a Slack message. We are always here to help.