Switchback Tests

Learn about switchback testing methodology and how to set up switchback experiments for marketplaces and network effect scenarios.

What is switchback testing?

Switchback tests are an alternative experiment form in which an entire population switches back and forth between test and control treatments on a set cadence. In a standard A/B test, the population is instead split and evenly divided between test and control for the duration of the experiment.

Switchback tests are particularly common in marketplaces. Running a traditional A/B test on one side of the marketplace can have unintended consequences on the rest of the marketplace due to network effects, which affect experiment results.

Another common use case for switchbacks is when applying different variants to different users isn't feasible for fairness, legal, or logistical reasons.

Switchback tests often run across multiple "buckets", typically regions or other defined groups that Statsig flips between test and control treatments over the course of the experiment.

Example

Consider a rideshare platform that wants to test pricing. The initial approach splits riders into two groups: one with a higher price and one with a lower price.

Riders with the lower price request rides at a significantly higher rate, consuming all available driver supply in a given area. The reduced driver supply leaves riders with higher prices facing both a higher ride estimate and longer ETAs, making them even less likely to request a ride.

The experiment results become unclear: the decreased ride request rate in the higher-price group could stem from the higher prices or the longer ETAs. The experimental design has introduced bias into the results.

A switchback test resolves this. Instead of splitting users, 100% of riders and drivers in a given metro switch in and out of the new pricing plan hourly. The test then measures the impact on overall ride request rates during hours when prices were higher versus lower.

Switchback testing on Statsig

Methodology

The switchback testing methodology for computing results consists of 3 steps:

Attribute events to the corresponding switchback bucket, where the time window and grouping attribute define each bucket.
Calculate the variant-level and bucket-level metrics based on the attributed events.
Calculate the difference in means between test and control. Use bootstrapping to obtain the confidence intervals.

Event attribution

Statsig attributes events to a particular bucket based on the timestamp and unit_id of the exposure, the length of the attribution window, and the timestamp of subsequent events for that unit_id.

For example: Statsig exposes user 123 to bucket A at 9:15 AM. The test has an attribution window of 90 minutes. Statsig includes all events triggered by user 123 between 9:15 AM and 10:45 AM in the metric calculations for bucket A.

Bucket-level metrics

Once Statsig has all the events corresponding to a bucket, it calculates the scorecard metrics derived from these events.

Bucket metrics table summarizing switchback exposures

For sum and count metrics, Statsig uses the mean value per unit exposed to that bucket.

Metric detail view showing per-unit averages within a bucket

Variant-level metrics

Statsig calculates overall metric means for test and control by aggregating values across all buckets in that variant. If there are M buckets in the test group, the mean value of a ratio metric is:

Formula showing ratio metric mean calculation across switchback buckets

The mean of a sum or count metric is:

Formula for sum or count metric averages in switchback tests

Deltas and confidence intervals

Statsig calculates the treatment effect as:

Equation for treatment effect delta between test and control

Statsig obtains the bootstrapped confidence intervals as follows:

Collect a bootstrap sample with replacement from the set of test buckets and separately from the set of control buckets.
Calculate the difference in means between test and control samples.
Repeat steps one and two 10,000 times to produce a distribution of metric deltas.
The 95% confidence interval is the range from the 2.5% quantile to the 97.5% quantile of that distribution. In general, the confidence interval with significance level $\alpha$ is:

Bootstrap confidence interval formula for switchback tests

Setup

To set up a switchback test on Statsig, when you create an experiment tap Advanced Settings → Experiment Type and select "Switchback Test".

Experiment type menu selecting switchback test

Switchback test configuration adds two new aspects to the standard experiment setup:

Targeting: The defined population(s) you run your experiment on.
Schedule: The switching frequency and starting treatments for different pre-defined populations.

There are two ways to define targeting:

Targeting Gate: Specify a targeting gate to define your target experiment population, the same as any other experiment on Statsig.
Bucketing Method: Bucket users based on either pre-defined buckets or randomized across an ID type.

Switchback targeting configuration showing gate and bucketing options

Buckets let you specify pre-defined buckets, such as Country, Locale, or a Custom Field you log. Use this option when you have a few pre-defined populations you want to switch in and out of Test/Control over the course of the experiment.

Buckets configuration table listing predefined regions

ID Type lets you specify an ID type to randomize across. For example, choosing a custom ID such as CityID automatically randomizes different CityIDs across Treatment/Control over the different switchback windows. Use this option when you have a very large or dynamic number of experiment units to randomize across.

Randomized bucketing is an advanced feature. Contact the support team, your sales contact, or the Slack community so they can enable it.

Randomized ID bucketing interface selecting custom ID

Depending on which bucketing method you select, the Schedule section of experiment setup lets you configure:

Start time
Duration (in days)
Assignment window size (in minutes)
Burn-in/ burn-out periods (in minutes)
(Pre-defined bucketing only) Starting phase (treatment group) for each bucket

Switchback schedule editor specifying assignment windows and bucket starting phases

Burn-in/burn-out periods let you define time intervals at the start and end of your switchback windows to discard exposures from analysis. Use these when there are risks of bleed-over effects from the previous treatment while a population is switching between test and control.

Reading results

Diagnostics and Pulse metric lift results for switchback tests resemble Statsig's traditional A/B tests, with a few differences:

No hourly Pulse: Because a switchback experiment starts with all-Test or all-Control exposures, Statsig disables hourly Pulse until there is a meaningful amount of data. Use the Diagnostics tab in the meantime to verify checks are arriving and bucketing as expected.
No time-series: Switchback tests don't support the Daily or Days Since First Exposure time-series. The bootstrapping methodology requires pooling all available days together to achieve sufficient statistical power.
No dimension breakdown: Switchback tests don't support breaking down a metric by user property or event property.
Advanced statistical techniques: Switchback tests don't yet support CUPED and Sequential Testing.

Switchback experiment pulse results showing bucket-level metrics

Was this helpful?