Setup Checklist
Onboarding checklist for Statsig Warehouse Native, covering warehouse connection, metric setup, assignment sources, and your first experiment launch.
After you've connected your warehouse and set up both metrics and assignment sources, verify your setup by checking the following items:
- Primary keys
- Timestamps
- Duplication
- Data availability
After completing these checks, your offline results should align with those in the Statsig console when advanced features are disabled.
1. Primary keys
When setting up an experiment, you can select the unit of assignment, acting as the primary key to join the assignment with metrics. The assignment source and the metrics source must use the same primary key.
In an Analyze-Only experiment, this primary key can be selected from the unit IDs defined by your assignment source.
- Ensure the unit ID in your assignment source matches the unit ID in your metrics source.
In an Assign and Analyze experiment, the Statsig SDK generates the primary key (unit ID).
- You can verify this unit ID in the statsig_forwarded_exposures table within the assignment sources.
- You must either forward the unit ID to the SDK (docs) or use the SDK to manage your features and correspondingly generate the metrics table.
2. Timestamps
Analyze metric data only after a user has been exposed to the experiment. Pre-experiment data has no average treatment effect, so including it dilutes results. Statsig uses a timestamp-based join for this purpose, with an option for a date-based join for daily data. The join should look like the SQL snippet below:
WITH
metrics as (...),
exposures as (...),
joined_data as (
SELECT
exposures.unit_id,
exposures.experiment_id,
exposures.group_id,
metrics.timestamp,
metrics.value
FROM exposures
JOIN metrics
ON (
exposures.unit_id = metrics.unit_id
AND metrics.timestamp >=
exposures.first_timestamp
)
)
SELECT
group_id,
SUM(value) as value
FROM joined_data
GROUP BY group_id;
3. Exposure duplication
Exposure data must be deduplicated before joining to ensure a single record per user. Many vendors also manage crossover users (users present in more than one experiment group) by removing them from analysis or alerting when this occurs with high frequency.
SELECT
unit_id,
experiment_id,
MIN(timestamp) as first_timestamp,
COUNT(distinct group_id) as groups
FROM <exposures_table>
GROUP BY
unit_id,
experiment_id,
group_id
HAVING COUNT(distinct group_id) = 1;
4. Data availability
When comparing a platform analysis to an existing experiment analysis that was run in the past, the underlying data may have fallen out of retention or been deleted. To check this, compare the table's retention policy to the analysis dates used in your original experiment analysis to confirm the data still exists.
Verify that results match
After completing the above four steps, your offline analysis should produce results that match those in the Statsig console.
If you have additional questions, send Statsig a Slack message.
Was this helpful?