On this page

Switchback Tests

Learn about switchback testing methodology and how to set up switchback experiments for marketplaces and network effect scenarios.

What is switchback testing?

Switchback tests are an alternative experiment form in which an entire population switches back and forth between test and control treatments on a set cadence, rather than being split and evenly divided between test and control for the duration of the experiment.

Switchback tests are particularly common in marketplaces. Running a traditional A/B test on one side of the marketplace can have unintended consequences on the rest of the marketplace due to network effects, which affect experiment results.

Another common use case for switchbacks is when applying different variants to different users isn't feasible for fairness, legal, or logistical reasons.

Switchback tests are often carried out across multiple "buckets", typically regions or other defined groups that are flipped between test and control treatments over the course of the experiment.

Example

Consider a rideshare platform that wants to test pricing. The initial approach splits riders into two groups: one with a higher price and one with a lower price.

Riders with the lower price request rides at a significantly higher rate, consuming all available driver supply in a given area. This leaves riders with higher prices facing not only a higher ride estimate but also longer ETAs, making them even less likely to request a ride.

The experiment results become unclear: the decreased ride request rate in the higher-price group could be caused by the higher prices or by the longer ETAs. The experimental design has introduced bias into the results.

A switchback test resolves this. Instead of splitting users, 100% of riders and drivers in a given metro switch in and out of the new pricing plan hourly. The test then measures the impact on overall ride request rates during hours when prices were higher versus lower.

Switchback testing on Statsig

Methodology

The switchback testing methodology for computing results consists of 3 steps:

  1. Attribute events to the corresponding switchback bucket, where each bucket is defined by the time window and grouping attribute.
  2. Calculate the variant-level and bucket-level metrics based on the attributed events.
  3. Calculate the difference in means between test and control. Use bootstrapping to obtain the confidence intervals.

Event attribution

Attribution of events to a particular bucket is based on the timestamp and unit_id of the exposure, the length of the attribution window, and the timestamp of subsequent events for that unit_id.

For example: User 123 is exposed to bucket A at 9:15 AM. The test has an attribution window of 90 minutes. All events triggered by user 123 between 9:15 AM and 10:45 AM are included in the metric calculations for bucket A.

Bucket-level metrics

Once Statsig has all the events corresponding to a bucket, it calculates the scorecard metrics derived from these events.

Bucket metrics table summarizing switchback exposures

For sum and count metrics, Statsig uses the mean value per unit exposed to that bucket.

Metric detail view showing per-unit averages within a bucket

Variant-level metrics

Statsig calculates overall metric means for test and control by aggregating values across all buckets in that variant. If there are M buckets in the test group, the mean value of a ratio metric is given by:

Formula showing ratio metric mean calculation across switchback buckets

The mean of a sum or count metric would be:

Formula for sum or count metric averages in switchback tests

Deltas and confidence intervals

The treatment effect is calculated as:

Equation for treatment effect delta between test and control

The bootstrapped confidence intervals are obtained as follows:

  1. Collect a bootstrap sample with replacement from the set of test buckets and separately from the set of control buckets.
  2. Calculate the difference in means between test and control samples.
  3. Repeat steps one and two 10,000 times to produce a distribution of metric deltas.
  4. The 95% confidence interval is the range from the 2.5% quantile to the 97.5% quantile of that distribution. In general, the confidence interval with significance level $\alpha$ is given by:

Bootstrap confidence interval formula for switchback tests

Setup

To set up a switchback test on Statsig, when you create an experiment tap Advanced SettingsExperiment Type and select "Switchback Test".

Experiment type menu selecting switchback test

Switchback test configuration adds two new aspects to the standard experiment setup:

  1. Targeting: The defined population(s) you run your experiment on.
  2. Schedule: The switching frequency and starting treatments for different pre-defined populations.

There are two ways to define targeting:

  • Targeting Gate: Specify a targeting gate to define your target experiment population, the same as any other experiment on Statsig.
  • Bucketing Method: Bucket users based on either pre-defined buckets or randomized across an ID type.

Switchback targeting configuration showing gate and bucketing options

Buckets let you specify pre-defined buckets, such as Country, Locale, or a Custom Field you log. Use this option when you have a few pre-defined populations you want to switch in and out of Test/Control over the course of the experiment.

Buckets configuration table listing predefined regions

ID Type lets you specify an ID type to randomize across. For example, choosing a custom ID such as CityID automatically randomizes different CityIDs across Treatment/Control over the different switchback windows. Use this option when you have a very large or dynamic number of experiment units to randomize across.

Randomized bucketing is an advanced feature. Reach out to the support team, your sales contact, or the Slack community to have this enabled.

Randomized ID bucketing interface selecting custom ID

Depending on which bucketing method is selected, the Schedule section of experiment setup lets you configure:

  • Start time
  • Duration (in days)
  • Assignment window size (in minutes)
  • Burn-in/ burn-out periods (in minutes)
  • (Pre-defined bucketing only) Starting phase (treatment group) for each bucket

Switchback schedule editor specifying assignment windows and bucket starting phases

Burn-in/burn-out periods let you define time intervals at the start and end of your switchback windows to discard exposures from analysis. Use these when there are risks of bleed-over effects from the previous treatment while a population is switching between test and control.

Reading results

Diagnostics and Pulse metric lift results for switchback tests resemble Statsig's traditional A/B tests, with a few differences:

  • No hourly Pulse: Because a switchback experiment starts with all-Test or all-Control exposures, hourly Pulse is disabled until there is a meaningful amount of data. Use the Diagnostics tab in the meantime to verify checks are arriving and bucketing as expected.
  • No time-series: The Daily and Days Since First Exposure time-series aren't available for switchback tests. The bootstrapping methodology requires pooling all available days together to achieve sufficient statistical power.
  • No dimension breakdown: Breaking down a metric by user property or event property isn't available for switchback tests.
  • Advanced statistical techniques: CUPED and Sequential Testing aren't yet available for switchback tests.

Switchback experiment pulse results showing bucket-level metrics

Was this helpful?