Warehouse Storage
Manage storage in Statsig Warehouse Native, including where intermediate datasets live in your warehouse and how to control object lifetimes and costs.
How warehouse storage works
Statsig uses its sandbox in your warehouse to cache intermediate tables and result tables. This enables incremental reloads (without recalculating metrics for every day of the experiment on each load) and allows you to use these tables for ad-hoc analysis.
Statsig stores tables in the sandbox schema or dataset you configured. You can use this to track storage footprint and manage permissions.
Conventions and usage
Statsig shards tables by entity ID. For example, the experiment early_user_journey_acceleration has that identifier in its associated table names for scorecard loads. This is a reliable way to look up tables for a given experiment.
Statsig writes to special tables that appear in metric sources or assignment sources:
- pipeline overview: performance statistics for the jobs Statsig runs
- statsig_forwarded_events: events logged through statsig.log_event
- statsig_forwarded_exposures: exposures from experiments, gates, autotunes, and holdouts
- statsig_forwarded_switchback_exposures: switchback-formatted exposures
- statsig_daily_results: rendered results with statistics like p-value
Some of these tables have pre-set names; others you configure in data connection settings.
Many users ingest these tables as part of internal pipelines. Configure lookback windows so mutable data doesn't cause issues, because Statsig regularly updates data in these tables and in some cases backfills up to several days when data delays or repairs occur.
Exposures don't necessarily deduplicate. Fast-forwarded exposures duplicate records from daily exports, and Statsig retains only 30 days of history for warehouse native projects. After 30 days, Statsig treats a given unit's exposure as new and re-exports it.
Volume
Scorecard loads generate a varying number of tables depending on the number of metric sources accessed and the types of metrics loaded. Statsig may also materialize intermediate tables before or after large operations, which reduces compute cost.
This can produce a large number of artifacts. Customers running 300+ experiments have encountered default quota limits on vendors like Databricks. You can address this by requesting a quota increase or configuring the TTLs described in the Management section.
Management
Transient tables have a short TTL, usually 1-2 days, and Statsig automatically cleans them up.
Other tables are permanent by default. You can clean them up from the experiment in Statsig's console or as part of launching an experiment. You can also configure TTLs per table type in the data connection section of a project's settings.
Plan to manage storage using your own warehouse tools in addition to Statsig's systems, for example by cleaning up entities that haven't been accessed or modified in the last month. Ideally this isn't necessary given TTLs, but there are known cases where Statsig's internal tracking can consider a table dropped when it still has a storage footprint. Statsig can't guarantee that all tables will be removed.
How TTLs work
When Statsig creates or modifies a managed table, it schedules a cleanup at the current time plus the TTL. For example, if a Result table is written on 2024-06-01 and Result tables are configured with a 14-day TTL, a deletion is scheduled for 2024-06-15.
If that table is modified on 2024-06-07 (for example, through a scorecard reload), the deletion request is reset to 2024-06-21, overwriting the existing one. This means incremental updates on long-running experiments keep their staging data until the experiment stops.
Changing TTLs doesn't retroactively affect existing tables' deletion requests. The new TTL applies at the next scorecard load for the relevant experiment.
Types of tables for TTL
Result Datasets: the final tables Statsig creates at the end of an experiment or gate reload, containing aggregated group-metric level data. These are generally small (1 row per metric/day/group/dimension) and useful for post-hoc analysis.Intermediate Tables: all other tables Statsig writes to during an experiment reload. These can be large because they contain user-level data. Statsig reuses them for incremental and metric reloads.Transient Datasets: tables created for one-off queries (most commonly Explore queries and Power Analyses), or temporary datasets used while creatingIntermediate Tablesas a performance optimization. By default, these are dropped after 2-3 days unless overridden with the setting above.
Explore query dependencies: Explore queries rely on permanent staging tables. These tables reduce the need to recompute data for analysis already performed by the scorecard run. Unlike results tables (which are cached locally on Statsig servers), permanent staging tables must be maintained in your warehouse for Explore queries to function. This avoids reprocessing large volumes of data that may contain PII or other sensitive information.
Troubleshooting storage issues
Missing data errors
Warehouse Native users may encounter TABLE_OR_VIEW_NOT_FOUND errors when required data tables are missing from the warehouse. This typically occurs when:
- Permanent staging tables have been dropped: Explore queries and advanced analysis require permanent staging tables, not results or transient staging tables.
- TTL settings have expired tables: Tables with configured time-to-live (TTL) settings get automatically cleaned up.
- Incomplete data loads: Initial experiment setup or data pipeline issues may prevent table creation.
Resolution steps
For missing staging tables: missing permanent staging tables require a full reload to recreate the staging dataset.
For general missing tables:
- Check your warehouse's TTL settings in the data connection configuration.
- Verify that permanent staging tables exist in your configured sandbox schema.
- If tables were manually dropped, trigger a full data reload.
- Contact support if tables are still missing after reload, or if you didn't drop them.
Storage dependencies
Warehouse Native uses several types of tables with different storage patterns:
- Permanent staging tables: required for Explore queries and advanced analysis.
- Transient staging tables: short-lived intermediate tables with a mix of automatic cleanup (1-2 days TTL) and permanent storage (small tables useful for ad-hoc analysis like regression coefficients).
- Results tables: output statistics from the pipeline, copied and cached locally on Statsig servers.
Vacuum jobs don't affect staging tables used by Statsig.
Was this helpful?