Warehouse Storage

Manage storage in Statsig Warehouse Native, including where intermediate datasets live in your warehouse and how to control object lifetimes and costs.

How warehouse storage works

Statsig uses its sandbox in your warehouse to cache intermediate tables and result tables. Caching enables incremental reloads (without recalculating metrics for every day of the experiment on each load) and allows you to use these tables for ad-hoc analysis.

Statsig stores tables in the sandbox schema or dataset you configured. You can use this location to track storage footprint and manage permissions.

Conventions and usage

Statsig shards tables by entity ID. For example, the experiment early_user_journey_acceleration has that identifier in its associated table names for scorecard loads. This naming convention is a reliable way to look up tables for a given experiment.

Statsig writes to special tables that appear in metric sources or assignment sources:

pipeline overview: performance statistics for the jobs Statsig runs
statsig_forwarded_events: events logged through statsig.log_event
statsig_forwarded_exposures: exposures from experiments, gates, autotunes, and holdouts
statsig_forwarded_switchback_exposures: switchback-formatted exposures
statsig_daily_results: rendered results with statistics like p-value

Some of these tables have pre-set names; others you configure in data connection settings.

Many users ingest these tables as part of internal pipelines. Statsig regularly updates data in these tables, and in some cases backfills up to several days when data delays or repairs occur. To prevent mutable data from causing issues, configure lookback windows.

Exposures don't necessarily deduplicate. Fast-forwarded exposures duplicate records from daily exports, and Statsig retains only 30 days of history for warehouse native projects. After 30 days, Statsig treats a given unit's exposure as new and re-exports it.

Volume

Scorecard loads generate a varying number of tables depending on the number of metric sources accessed and the types of metrics loaded. Statsig may also materialize intermediate tables before or after large operations, which reduces compute cost.

This process can produce a large number of artifacts. Customers running 300+ experiments have encountered default quota limits on vendors like Databricks. You can address quota limits by requesting a quota increase or configuring the TTLs described in the Management section.

Management

Transient tables have a short TTL, usually 1-2 days, and Statsig automatically cleans them up.

Other tables are permanent by default. You can clean them up from the experiment in Statsig's console or as part of launching an experiment. You can also configure TTLs per table type in the data connection section of a project's settings.

Plan to manage storage using your own warehouse tools in addition to Statsig's systems. For example, clean up entities that haven't been accessed or modified in the last month. Ideally, manual cleanup isn't necessary given TTLs, but there are known cases where Statsig's internal tracking can consider a table dropped when it still has a storage footprint. Statsig can't guarantee that it removes all tables.

How TTLs work

When Statsig creates or modifies a managed table, it schedules a cleanup at the current time plus the TTL. For example, if Statsig writes a Result table on 2024-06-01 and you configure Result tables with a 14-day TTL, Statsig schedules a deletion for 2024-06-15.

If Statsig modifies that table on 2024-06-07 (for example, through a scorecard reload), it resets the deletion request to 2024-06-21, overwriting the existing one. This behavior means incremental updates on long-running experiments keep their staging data until the experiment stops.

A TTL change doesn't retroactively affect existing tables' deletion requests. The new TTL applies at the next scorecard load for the relevant experiment.

Types of tables for TTL

Result Datasets: the final tables Statsig creates at the end of an experiment or gate reload, containing aggregated group-metric level data. These are generally small (1 row per metric/day/group/dimension) and useful for post-hoc analysis.
Intermediate Tables: all other tables Statsig writes to during an experiment reload. These can be large because they contain user-level data. Statsig reuses them for incremental and metric reloads.
Transient Datasets: tables created for one-off queries (most commonly Explore queries and Power Analyses), or temporary datasets used while creating Intermediate Tables as a performance optimization. By default, Statsig drops these after 2-3 days unless you override them with a configured TTL.

Explore query dependencies: Explore queries rely on permanent staging tables. These tables reduce the need to recompute data for analysis the scorecard run already performed. Unlike results tables (which Statsig caches locally on its servers), you must maintain permanent staging tables in your warehouse for Explore queries to function. Maintaining these tables avoids reprocessing large volumes of data that may contain PII or other sensitive information.

Troubleshooting storage issues

Missing data errors

Warehouse Native users may encounter TABLE_OR_VIEW_NOT_FOUND errors when required data tables are missing from the warehouse. These errors typically occur when:

Permanent staging tables have been dropped: Explore queries and advanced analysis require permanent staging tables, not results or transient staging tables.
TTL settings have expired tables: Statsig automatically cleans up tables with configured time-to-live (TTL) settings.
Incomplete data loads: Initial experiment setup or data pipeline issues may prevent table creation.

Resolution steps

For missing staging tables: missing permanent staging tables require a full reload to recreate the staging dataset.

For general missing tables:

Check your warehouse's TTL settings in the data connection configuration.
Verify that permanent staging tables exist in your configured sandbox schema.
If you manually dropped tables, trigger a full data reload.
Contact support if tables are still missing after reload, or if you didn't drop them.

Storage dependencies

Warehouse Native uses several types of tables with different storage patterns:

Permanent staging tables: required for Explore queries and advanced analysis.
Transient staging tables: short-lived intermediate tables with a mix of automatic cleanup (1-2 days TTL) and permanent storage (small tables useful for ad-hoc analysis like regression coefficients).
Results tables: output statistics from the pipeline, copied and cached locally on Statsig servers.

Vacuum jobs don't affect staging tables used by Statsig.

Was this helpful?