On this page

Setup Checklist

Onboarding checklist for Statsig Warehouse Native, covering warehouse connection, metric setup, assignment sources, and your first experiment launch.

After you've connected your warehouse and set up both metrics and assignment sources, verify your setup by checking the following items:

  1. Primary keys
  2. Timestamps
  3. Duplication
  4. Data availability

After completing these checks, your offline results should align with those in the Statsig console when advanced features are disabled.

1. Primary keys

When setting up an experiment, you can select the unit of assignment, acting as the primary key to join the assignment with metrics. The assignment source and the metrics source must use the same primary key.

In an Analyze-Only experiment, this primary key can be selected from the unit IDs defined by your assignment source.

  • Ensure the unit ID in your assignment source matches the unit ID in your metrics source.

In an Assign and Analyze experiment, the Statsig SDK generates the primary key (unit ID).

  • You can verify this unit ID in the statsig_forwarded_exposures table within the assignment sources.
  • You must either forward the unit ID to the SDK (docs) or use the SDK to manage your features and correspondingly generate the metrics table.

2. Timestamps

Analyze metric data only after a user has been exposed to the experiment. Pre-experiment data has no average treatment effect, so including it dilutes results. Statsig uses a timestamp-based join for this purpose, with an option for a date-based join for daily data. The join should look like the SQL snippet below:

sql
WITH 
metrics as (...),
exposures as (...),
joined_data as (
    SELECT 
        exposures.unit_id,
        exposures.experiment_id,
        exposures.group_id,
        metrics.timestamp,
        metrics.value
    FROM exposures
    JOIN metrics
    ON (
        exposures.unit_id = metrics.unit_id
        AND metrics.timestamp >= 
        	exposures.first_timestamp
    )
)
SELECT 
    group_id,
    SUM(value) as value
FROM joined_data
GROUP BY group_id;

3. Exposure duplication

Exposure data must be deduplicated before joining to ensure a single record per user. Many vendors also manage crossover users (users present in more than one experiment group) by removing them from analysis or alerting when this occurs with high frequency.

sql
SELECT 
    unit_id,
    experiment_id,
    MIN(timestamp) as first_timestamp,
    COUNT(distinct group_id) as groups
FROM <exposures_table>
GROUP BY 
    unit_id,
    experiment_id,
    group_id
HAVING COUNT(distinct group_id) = 1;

4. Data availability

When comparing a platform analysis to an existing experiment analysis that was run in the past, the underlying data may have fallen out of retention or been deleted. To check this, compare the table's retention policy to the analysis dates used in your original experiment analysis to confirm the data still exists.

Verify that results match

After completing the above four steps, your offline analysis should produce results that match those in the Statsig console.

The Statsig Console includes several advanced features, such as winsorization, CUPED, and the delta method to address ratio metrics. Disable these features initially when comparing results.
This article provides an example of conducting offline calculations in Databricks.

If you have additional questions, send Statsig a Slack message.

Was this helpful?