What is Multi-Armed Bandit?

What it is

A working definition of the multi-armed bandit.

The multi-armed bandit problem takes its name from a scenario where a gambler chooses which of several slot machines (one-armed bandits) to play to maximize total winnings, without knowing in advance which machine has the highest payout rate. Each pull generates a reward that provides information about that machine’s true payout rate, but pulling one machine foregoes the opportunity to learn about or earn from the others. The fundamental tension is between exploration (trying less-tested options to learn their true performance) and exploitation (choosing the currently best-known option to maximize immediate reward). The optimal strategy balances both throughout the decision process rather than committing entirely to either.

Several algorithms address this tradeoff with different strategies. Epsilon-greedy selects the current best-performing option with probability 1 minus epsilon and a random option with probability epsilon, where epsilon is a small number such as 0.1. Upper Confidence Bound methods select the option with the highest upper confidence bound on its estimated performance, favoring options that have been tried few times and therefore have wide uncertainty intervals. Thompson Sampling maintains a probability distribution over each option’s true performance and selects options by sampling from these distributions, naturally allocating more exploration to uncertain options and more exploitation to options with consistently high estimated performance.

Contextual bandits extend the basic framework by incorporating features of the current context into the selection decision. Rather than asking “which option performs best overall,” a contextual bandit asks “which option performs best given the current user, page, time, and other contextual features.” This makes contextual bandits the natural framework for real-time personalization, where the goal is to select the best content or offer for each individual given their specific context, learning which features predict which options work best for which users as data accumulates.

Why ad agencies care

Why bandit-based optimization is more efficient than fixed A/B testing for ongoing creative and content decisions.

A working ad agency running creative optimization programs for clients using traditional A/B testing splits traffic equally between variants for a fixed period, then analyzes results and shifts all traffic to the winner. This approach sacrifices performance during the test period, when half the traffic is on potentially underperforming variants, and delays exploitation of the winner until the test concludes. Multi-armed bandit approaches continuously shift traffic allocation toward better-performing variants as evidence accumulates, reducing the regret from testing and producing better average performance across the full testing period.

Creative variant selection in high-volume campaigns benefits most from bandit-based allocation. For campaigns running tens of thousands of impressions per day, the opportunity cost of allocating substantial traffic to underperforming creative variants during a fixed A/B test period is significant. A Thompson Sampling bandit that continuously rebalances traffic allocation based on observed conversion rates can shift 90% of traffic to the best-performing variant within 3 to 5 days for a campaign with 20,000 daily impressions, compared to the 2 to 3 weeks required for a fixed A/B test to reach statistical significance at the same traffic volume. The difference in cumulative conversions from the two approaches can be substantial at scale.

Contextual bandits personalize content selection without requiring explicit user segmentation. A content recommendation system that selects which article, video, or product to feature for each site visitor is a contextual bandit problem: the context is the visitor’s behavioral signals and the actions are the available content options. Unlike segmentation-based personalization that first assigns users to predefined segments and then serves segment-appropriate content, contextual bandits learn the direct relationship between contextual features and content performance without requiring the intermediate segmentation step. This enables fine-grained personalization that responds to the full richness of available contextual signals rather than the coarse category assignments used in rule-based personalization.

Budget allocation across ad channels can be framed as a multi-armed bandit problem for dynamic reallocation. A campaign flight with a daily budget to allocate across multiple channels faces a bandit-like tradeoff: concentrating budget in the currently best-performing channel maximizes immediate performance but forgoes learning about channels whose current performance is uncertain due to limited recent data. Bandit-based budget allocation approaches that maintain uncertainty estimates for each channel’s current ROI and allocate budget in proportion to these estimates can outperform static allocation schedules in environments where channel performance fluctuates with factors such as competitive pressure, seasonal audience composition changes, and platform algorithm updates.

In practice

What multi-armed bandit looks like inside a working ad agency.

An agency is running an email subject line optimization program for an e-commerce client with a subscriber list of 220,000 active users and a weekly promotional newsletter. The client has historically run a 10% A/B test of two subject line variants for two weeks before rolling out the winner to the remaining 90% of the list. The agency proposes replacing this with a Thompson Sampling bandit that continuously reallocates sends across four subject line variants for each weekly newsletter. In week one, the four variants are sent to equal 25% splits of a 20,000-subscriber exploration pool. Open rates from the first day of sends update the Thompson Sampling posterior distributions for each variant. By day 3, the algorithm has identified that variant B has a clear advantage with a posterior mean open rate of 28.4% versus 19.2%, 22.1%, and 21.8% for the other three variants. Allocation shifts to 71% variant B, 11% each for the other three variants for the remainder of the week’s sends. The following week, a new set of four variants enters the exploration pool with the same process. Compared to the prior two-week fixed A/B testing approach, the bandit approach produces 4,200 incremental opens per month from better-allocated sends during the exploration period and delivers winner variants 6 days faster on average. The client’s marketing operations team approves the approach for all future newsletters after seeing the first-quarter results.

Multi-Armed Bandit.

A working definition of the multi-armed bandit.

Why bandit-based optimization is more efficient than fixed A/B testing for ongoing creative and content decisions.

What multi-armed bandit looks like inside a working ad agency.

Build the adaptive testing and optimization expertise that improves creative performance through The Creative Cadence Workshop.

Multi-Armed Bandit.

A working definition of the multi-armed bandit.

Why bandit-based optimization is more efficient than fixed A/B testing for ongoing creative and content decisions.

What multi-armed bandit looks like inside a working ad agency.

Build the adaptive testing and optimization expertise that improves creative performance through The Creative Cadence Workshop.

Concepts in the multi-armed bandit’s territory.