Reconciling Results Between Experimentation Platforms

Learn how to reconcile differences in experiment results between different analysis platforms.

Why results differ between platforms

Experiment results can differ between analysis platforms because the same data supports many analysis methodologies. Most differences trace back to how each platform joins exposure data to metric data, which statistical features it applies, or how it defines a metric. Work through those causes in order to find and resolve the gaps.

General approach

When companies evaluate an experimentation vendor, differences in results between their in-house platform and the vendor's platform are common during Proof-of-Concept (POC) validations. You can typically resolve these gaps by working through the following hypotheses in order:

The two platforms read or join the metric source data to exposure data differently, invalidating downstream steps.
Advanced statistical features available on the vendor side but not in-house are working as intended, most often reducing the influence of outliers or pre-experiment bias.
There is a misunderstanding of how a metric definition works, or how an advanced configuration on a metric or experiment behaves.

By working through these in order, data teams can quickly understand and address gaps, or decide whether the vendor's approach is acceptable.

Joining data

Based on observational data, differences in experiment results most often stem from how the platform joins exposure data with metric data. At the end of this section, there is a basic check for confirming that a join issue isn't occurring.

ID formats

In some cases, systems log IDs in different formats to different places. For example, the binary ID 4TLCtqzctSqusYcQljJLJE maps to the UUID a0fb4ef0-9d9e-11eb-9462-7bfc2b9a6ff2, so a company might have the binary ID in their production environment while their data users work with the equivalent UUIDs.

Exposures logged using the binary ID can't join with metric data using the UUID, and results are empty. Check samples for both the metric source and the assignment source or diagnostic logstream to confirm that the identifiers are in the same format.

You can use ID Resolution to bridge ID type gaps, but Statsig didn't design it for this scenario. ID Resolution helps connect identifiers across logged-out/logged-in sessions, or other scenarios where users commingle their identifiers by switching identifiers during the experiment.

Timestamps

Analyze metric data only after Statsig exposes a user to the experiment. Pre-experiment data should have no average treatment effect, so its inclusion dilutes results.

Statsig Cloud

Statsig Cloud uses a date-based join between exposures and metric data. Experiments include metric data from the whole of the first exposure date for each experimental unit. While Statsig can include some pre-experiment metric data, the average treatment effect of this dilution should be null. The SQL snippet below illustrates this:

plaintext

WITH
metrics as (...),
exposures as (...),
joined_data as (
    SELECT
        exposures.unit_id,
        exposures.experiment_id,
        exposures.group_id,
        metrics.timestamp,
        metrics.value
    FROM exposures
    JOIN metrics
    ON (
        exposures.unit_id = metrics.unit_id
        AND metrics.date_id >= exposures.first_date_id
    )
)
SELECT
    group_id,
    SUM(value) as value
FROM joined_data
GROUP BY group_id;

Statsig's exposures are always in UTC. If metric data is in another timezone, adjust it to avoid filtering on the wrong comparison.

Statsig does support timestamp-based joins for some Enterprise Cloud customers. Contact Statsig to learn more.

Statsig Warehouse Native

Statsig WHN uses a timestamp-based join, with an option for a date-based join for daily data. The SQL snippet below illustrates this:

plaintext

WITH
metrics as (...),
exposures as (...),
joined_data as (
    SELECT
        exposures.unit_id,
        exposures.experiment_id,
        exposures.group_id,
        metrics.timestamp,
        metrics.value
    FROM exposures
    JOIN metrics
    ON (
        exposures.unit_id = metrics.unit_id
        AND metrics.timestamp >= exposures.first_timestamp
    )
)
SELECT
    group_id,
    SUM(value) as value
FROM joined_data
GROUP BY group_id;

Timestamps for Statsig's exposures are always in UTC. If metric data is in another timezone, adjust it to avoid filtering on the wrong comparison.

Exposure duplication

De-duplicate exposure data before joining to ensure a single record per user. Many vendors also manage crossover users (users present in more than one experiment group), removing them from analysis or alerting when crossovers occur at high frequency.

plaintext

SELECT
    unit_id,
    experiment_id,
    MIN(timestamp) as first_timestamp,
    COUNT(distinct group_id) as groups
FROM <exposures_table>
GROUP BY
    unit_id,
    experiment_id,
    group_id
HAVING COUNT(distinct group_id) = 1;

Data availability

When comparing a platform analysis to an existing experiment analysis that ran in the past, the underlying data may have fallen out of retention or been deleted. Compare the table's retention policy to the analysis dates used in your original experiment analysis to confirm that the data still exists. Also confirm that you configured your experiment in the vendor console to analyze the same time range as your original analysis.

Validation

To validate the initial metric data and join, use the query provided in the Timestamps section, modifying it to run on both platforms. Confirm that a target metric has the same totals per group across both platforms. Warehouse Native platforms have an advantage here because the SQL dialect and source data are generally the same in both vendor code and in-house code, making comparisons simpler. Pick one metric of interest, validate that data, and resolve any differences before checking statistical and metric methodologies.

Statistical features

Choices in statistical methodologies can significantly impact experiment results. The following are common root causes for gaps in results. Always closely read the queries the vendor runs to understand any particulars in methodology.

Winsorization

Outlier trimming, or Winsorization, can dramatically alter experiment outcomes. Disable this feature in Statsig metrics when doing cross-system comparisons unless you're also applying it manually.

CUPED

CUPED can significantly change variances and observed deltas, especially with high pre- and post-exposure data correlation or systematic differences in groups' pre-experiment data. You can configure CUPED at a metric level. You can also disable it for a pulse result set after running analysis.

Ratio metrics

For ratio metrics using the delta method, Statsig includes only units with a non-zero denominator. Statsig calculates ratios and means as

$\bar{u} = \frac{\sum_{i=0}^{n}(numerator_i)}{\sum_{i=0}^{n}(denominator_i)}$

and uses the delta method to correct for the cluster-based nature of these metrics.

Metric definitions

Users often misunderstand how Statsig calculates a given metric. Refer to the comprehensive metrics guide for details.

Summary

Following these steps clarifies where any gaps between two experiment platforms are coming from. Statsig provides the intermediate and result datasets it uses, as well as the queries used in its analysis, so you can understand where gaps arise. If you get stuck, reach out for help. For an overview of experiment pipeline patterns, refer to the Statsig Warehouse Native Documentation and Statsig Pipeline Overview.

Was this helpful?