On this page

Reconciling Results Between Experimentation Platforms

Learn how to reconcile differences in experiment results between different analysis platforms.

Why results differ between platforms

The same data can yield very different interpretations in experiment results due to the wide variety of analysis methodologies available. One advantage of modern experimentation platforms is ensuring consistency and transparency in experimental analysis within your organization. This guide covers common gaps between platforms and how to identify and resolve them.

General approach

When companies evaluate an experimentation vendor, differences in results between their in-house platform and the vendor's platform are common during Proof-of-Concept (POC) validations. You can typically resolve these gaps by working through the following hypotheses in order:

  1. The metric source data is being read or joined to exposure data differently, invalidating downstream steps.
  2. Advanced statistical features available on the vendor side but not in-house are working as intended, most often reducing the influence of outliers or pre-experiment bias.
  3. There is a misunderstanding of how a metric definition works, or how an advanced configuration on a metric or experiment behaves.

By working through these in order, data teams can quickly understand and address gaps, or decide whether the vendor's approach is acceptable.

Joining data

Based on observational data, differences in experiment results most often stem from how exposure data is joined with metric data. At the end of this section, there is a basic check for confirming this isn't occurring.

ID formats

In some cases, IDs are logged in different formats to different places. For example, the binary ID 4TLCtqzctSqusYcQljJLJE maps to the UUID a0fb4ef0-9d9e-11eb-9462-7bfc2b9a6ff2, so a company might have the binary ID in their production environment while their data users work with the equivalent UUIDs.

Exposures logged using the binary ID can't join with metric data using the UUID, and results are empty. Check samples for both the metric source and the assignment source or diagnostic logstream to confirm that the identifiers are in the same format.

You can use ID Resolution to bridge ID type gaps, but it isn't intended to solve this scenario. ID Resolution helps connect identifiers across logged-out/logged-in sessions, or other scenarios where users commingle their identifiers by switching identifiers during the experiment.

Timestamps

Analyze metric data only after a user has been exposed to the experiment. Pre-experiment data should have no average treatment effect, so its inclusion dilutes results.

Statsig Cloud

Statsig Cloud uses a date-based join between exposures and metric data. Experiments include metric data from the whole of the first exposure date for each experimental unit. While some pre-experiment metric data can be included, the average treatment effect of this dilution should be null. This is illustrated in the SQL snippet below:

plaintext
WITH
metrics as (...),
exposures as (...),
joined_data as (
    SELECT
        exposures.unit_id,
        exposures.experiment_id,
        exposures.group_id,
        metrics.timestamp,
        metrics.value
    FROM exposures
    JOIN metrics
    ON (
        exposures.unit_id = metrics.unit_id
        AND metrics.date_id >= exposures.first_date_id
    )
)
SELECT
    group_id,
    SUM(value) as value
FROM joined_data
GROUP BY group_id;

Statsig's exposures are always in UTC. If metric data is in another timezone, adjust it to avoid filtering on the wrong comparison.

Statsig does support timestamp-based joins for some Enterprise Cloud customers. Contact Statsig to learn more.

Statsig Warehouse Native

Statsig WHN uses a timestamp-based join, with an option for a date-based join for daily data. This is illustrated in the SQL snippet below:

plaintext
WITH
metrics as (...),
exposures as (...),
joined_data as (
    SELECT
        exposures.unit_id,
        exposures.experiment_id,
        exposures.group_id,
        metrics.timestamp,
        metrics.value
    FROM exposures
    JOIN metrics
    ON (
        exposures.unit_id = metrics.unit_id
        AND metrics.timestamp >= exposures.first_timestamp
    )
)
SELECT
    group_id,
    SUM(value) as value
FROM joined_data
GROUP BY group_id;

Timestamps for Statsig's exposures are always in UTC. If metric data is in another timezone, adjust it to avoid filtering on the wrong comparison.

Exposure duplication

De-duplicate exposure data before joining to ensure a single record per user. Many vendors also manage crossover users (users present in more than one experiment group), removing them from analysis or alerting when crossovers occur at high frequency.

plaintext
SELECT
    unit_id,
    experiment_id,
    MIN(timestamp) as first_timestamp,
    COUNT(distinct group_id) as groups
FROM <exposures_table>
GROUP BY
    unit_id,
    experiment_id,
    group_id
HAVING COUNT(distinct group_id) = 1;

Data availability

When comparing a platform analysis to an existing experiment analysis that ran in the past, the underlying data may have fallen out of retention or been deleted. Compare the table's retention policy to the analysis dates used in your original experiment analysis to confirm that the data still exists. Also confirm that your experiment in the vendor console is configured to analyze the same time range as your original analysis.

Validation

To validate the initial metric data and join, use the query provided in the Timestamps section, modifying it to run on both platforms. Confirm that a target metric has the same totals per group across both platforms. Warehouse Native platforms have an advantage here because the SQL dialect and source data are generally the same in both vendor code and in-house code, making comparisons simpler. Pick one metric of interest, validate that data, and resolve any differences before checking statistical and metric methodologies.

Statistical features

Choices in statistical methodologies can significantly impact experiment results. The following are common root causes for gaps in results. Always closely read the queries being run by the vendor to understand any particulars in methodology.

Winsorization

Outlier trimming, or Winsorization, can dramatically alter experiment outcomes. Disable this feature in Statsig metrics when doing cross-system comparisons unless you're also applying it manually.

CUPED

CUPED can significantly change variances and observed deltas, especially with high pre- and post-exposure data correlation or systematic differences in groups' pre-experiment data. You can configure CUPED at a metric level. You can also disable it for a pulse result set after running analysis.

Ratio metrics

For ratio metrics using the delta method, Statsig includes only units with a non-zero denominator. Statsig calculates ratios and means as

$\bar{u} = \frac{\sum_{i=0}^{n}(numerator_i)}{\sum_{i=0}^{n}(denominator_i)}$

and uses the delta method to correct for the cluster-based nature of these metrics.

Metric definitions

Users often misunderstand how a given metric is calculated. Refer to the comprehensive metrics guide for details.

Summary

Following these steps clarifies where any gaps between two experiment platforms are coming from. Statsig provides the intermediate and result datasets it uses, as well as the queries used in its analysis, making it straightforward to understand where gaps arise. If you get stuck, reach out for help. For an overview of experiment pipeline patterns, refer to the Statsig Warehouse Native Documentation and Statsig Pipeline Overview.

Was this helpful?