Setup Checklist

Onboarding checklist for Statsig Warehouse Native, covering warehouse connection, metric setup, assignment sources, and your first experiment launch.

This checklist verifies your Statsig Warehouse Native setup before you launch your first experiment. After you've connected your warehouse and set up both metrics and assignment sources, check the following items:

Primary keys
Timestamps
Duplication
Data availability

After completing these checks, your offline results should align with those in the Statsig console when you disable advanced features.

1. Primary keys

When setting up an experiment, you can select the unit of assignment, acting as the primary key to join the assignment with metrics. The assignment source and the metrics source must use the same primary key.

In an Analyze-Only experiment, you can select this primary key from the unit IDs your assignment source defines.

Ensure the unit ID in your assignment source matches the unit ID in your metrics source.

In an Assign and Analyze experiment, the Statsig SDK generates the primary key (unit ID).

You can verify this unit ID in the statsig_forwarded_exposures table within the assignment sources.
You must either forward the unit ID to the SDK (docs) or use the SDK to manage your features and correspondingly generate the metrics table.

2. Timestamps

Analyze metric data only after Statsig exposes a user to the experiment. Pre-experiment data has no average treatment effect, so including it dilutes results. Statsig uses a timestamp-based join for this purpose, with an option for a date-based join for daily data. The join should look like the SQL snippet below:

sql

WITH 
metrics as (...),
exposures as (...),
joined_data as (
    SELECT 
        exposures.unit_id,
        exposures.experiment_id,
        exposures.group_id,
        metrics.timestamp,
        metrics.value
    FROM exposures
    JOIN metrics
    ON (
        exposures.unit_id = metrics.unit_id
        AND metrics.timestamp >= 
        	exposures.first_timestamp
    )
)
SELECT 
    group_id,
    SUM(value) as value
FROM joined_data
GROUP BY group_id;

3. Exposure duplication

Deduplicate exposure data before joining to ensure a single record per user. Many vendors also manage crossover users (users present in more than one experiment group) by removing them from analysis or alerting when this occurs with high frequency.

sql

SELECT 
    unit_id,
    experiment_id,
    MIN(timestamp) as first_timestamp,
    COUNT(distinct group_id) as groups
FROM <exposures_table>
GROUP BY 
    unit_id,
    experiment_id,
    group_id
HAVING COUNT(distinct group_id) = 1;

4. Data availability

When comparing a platform analysis to an existing experiment analysis that was run in the past, the underlying data may have fallen out of retention or been deleted. To check this, compare the table's retention policy to the analysis dates used in your original experiment analysis to confirm the data still exists.

Verify that results match

After completing these four checks (primary keys, timestamps, exposure duplication, and data availability), your offline analysis should produce results that match those in the Statsig console.

The Statsig Console includes several advanced features, such as winsorization, CUPED, and the delta method to address ratio metrics. Disable these features initially when comparing results.

This article provides an example of conducting offline calculations in Databricks.

If you have additional questions, send Statsig a Slack message.

Was this helpful?