Methodology

How multi-armed bandits work in Statsig Autotune to automatically allocate traffic to the best- performing variant based on a single goal metric.

Model

The base Autotune implementation uses a Thompson Sampling (Bayesian) algorithm to estimate each variant's probability of being the best variant and allocate a proportional amount of traffic.

For example, if a given variant has a 60% probability of being the best, Autotune allocates 60% of the traffic to it. The multi-armed bandit algorithm adds more users to a treatment as soon as it determines that treatment is clearly better at maximizing the reward (the target metric).

Throughout the process, higher-performing treatments receive more traffic and underperforming treatments receive less. When the winning treatment beats the second-best treatment by a specified margin, the process ends.

Some helpful references:

Statsig Blog
Goyal and Agrawal (Microsoft Research) Regret Analysis
Doordash Engineering Summary Blog

Advantages

The main advantage of the base Multi-Armed Bandit over a contextual bandit is its ability to converge and identify the best variant. When a single solution works well for all users, the Multi-Armed Bandit efficiently allocates traffic and determines the correct long-term solution while minimizing regret. Regret is the cost of exposing many users to a worse variant, as happens in an A/B test.

Disadvantages

The main disadvantage of a Multi-Armed Bandit compared to a contextual bandit is its inability to personalize. When user attributes interact with variants, Autotune can identify a global maximum that is worse than serving each user their individual best variant.

For example, even if the "US Flag" variant had the highest overall CTR, it would be a poor choice for Canadian users. In such cases, both groups converge to a sub-optimal variant.

	A/B/n Test	Multi-Armed Bandit (Autotune)	Contextual Bandit (Autotune AI)	Ranking Engine
Typical # Variants	2-3	4-8	4-8	Arbitrary #
Personalization Factor	None	None	Moderate	High
Input Data Required	None	Very Little (100+ samples)	Little - generally 1000+ samples	Tens of thousands to millions of samples
Model Efficacy	None	Basic	Moderate	High
Identifies Best Variant	Yes	Yes	No	No
Consistent User Assignment	Yes	No	No	No

Was this helpful?