Multi-Armed Bandit (MAB) Methods in Trading

Contributor Image
Written By
Contributor Image
Written By
Dan Buckley
Dan Buckley is an US-based trader, consultant, and part-time writer with a background in macroeconomics and mathematical finance. He trades and writes about a variety of asset classes, including equities, fixed income, commodities, currencies, and interest rates. As a writer, his goal is to explain trading and finance concepts in levels of detail that could appeal to a range of audiences, from novice traders to those with more experienced backgrounds.
Updated

Imagine standing in a casino with rows of slot machines, each promising different payouts. Your goal? 

Maximize your winnings without knowing which machine is “hot.” 

This captures the essence of Multi-Armed Bandit (MAB) problems – a framework for balancing exploration (testing uncertain options) and exploitation (leveraging known winners). 

In trading, MAB methods are a capital and risk allocation framework – i.e., dynamically allocating capital to strategies, assets, or orders to optimize returns (or the specific goals of the strategy/portfolio). 

But how does this mathematical concept translate to financial markets? 

 


Key Takeaways – Multi-Armed Bandit (MAB) Methods in Trading

  • Balance Exploration & Exploitation – MAB models dynamically allocate capital between testing new opportunities and maximizing profits from proven strategies.
  • Adaptive Strategy Selection – Traders can use MAB to shift capital toward high-performing assets or strategies while maintaining a buffer for testing alternatives (that provide diversification or even higher returns than established strategies).
  • Algorithmic Optimization – MAB improves trading automation by learning from market shifts.
  • Risk-Aware Decision Making – When combined with risk metrics, MAB prevents overexposure to volatile assets and ensures more stable returns.
  • Beyond Traditional Models – MAB can be combined with reinforcement learning and contextual bandits. This helps optimize order execution, portfolio management, and strategy development.

 

Multi-Armed Bandit Background

Like our opening analogy, the Multi-Armed Bandit name comes from the analogy of a gambler facing multiple slot machines (“one-armed bandits”), each with unknown payouts, who must decide how to allocate their bets to maximize winnings while learning which machines are most profitable.

The MAB problem was first formalized in the 1950s in the field of sequential decision-making and probability theory, particularly in clinical trials and allocation problems.

It’s since evolved into a key concept in reinforcement learning, finance, and online optimization, helping solve real-world challenges like:

  • dynamic pricing
  • A/B testing, and
  • portfolio allocation…

…by balancing exploration (testing unknown options) and exploitation (maximizing known rewards).

 

The Exploration-Exploitation Dilemma

The essence of MAB lies in the exploration-exploitation trade-off. Should a trader stick with a proven strategy (exploitation) or test new approaches (exploration)? 

Too much exploitation risks missing better opportunities. Too much exploration burns capital/time/resources on unproven bets.

In trading, this balance is important. Markets evolve over time – yesterday’s winning asset or strategy may stagnate tomorrow. 

MAB algorithms quantify this uncertainty, using statistical methods to allocate resources adaptively. 

For example, during a market rally, a MAB model might exploit momentum strategies but still allocate a fraction of funds to explore defensive assets in case of a reversal.

 

Key MAB Algorithms and Their Mechanics

Epsilon-Greedy: Simplicity

The epsilon-greedy algorithm is the “training wheels” of MAB methods. It splits decisions into two modes:

  • Exploitation (1-ε) – Allocate most resources (e.g., 80%) to the historically best-performing option.
  • Exploration (ε) – Dedicate a small fraction (e.g., 20%) to randomly testing alternatives.

In trading, this could mean directing 80% of capital to steadier, known strategies while using 20% to experiment with riskier, higher-yield strategies. 

Fixed exploration rates can nonetheless miss short-term opportunities.

The 20% exploratory buffer may not be sufficient or even do poorly at the same time.

How much to exploit and how much to explore?

Balancing exploration and exploitation is important in both trading and business strategy.

It requires a nuanced evaluation of several factors. 

Exploitation (e.g., allocating 80% of resources to proven strategies) hinges on historical performance, stability, and risk predictability. 

In trading, this might involve sticking with assets or algorithms that have consistently yielded returns, assuming market conditions remain stable.

For businesses, exploitation leverages established products, loyal customer bases, and streamlined operations to maximize short-term profits.

But overreliance risks stagnation if markets shift or competitors innovate.

Exploration (e.g., dedicating 20% to testing new options) demands tolerance for uncertainty and investment in potential breakthroughs. 

Traders might experiment with emerging markets, untested algorithms (in live markets), or alternative data sources. 

Businesses might invest in R&D, pilot new markets, or adopt disruptive technologies.

Google, for example, has a VC arm, as do many other companies to ensure they remain relevant. 

Exploration is resource-intensive but vital for long-term resilience, as seen in industries like tech, where innovation cycles can have big impacts on a company’s survival.

Key factors influencing the balance include:

  1. Learning Needs – Someone who is completely new to something will have to explore at first, and gradually shift more resources to exploiting as you learn what works.
  2. Risk Appetite – High-risk tolerance favors exploration. Conservative environments prioritize exploitation.
  3. Market Dynamics – Volatile or saturated markets necessitate more exploration.
  4. Resource Flexibility – Ample capital allows greater experimentation without jeopardizing core operations.
  5. Competitive Pressure – Rivals’ innovations may force increased exploration.
  6. Feedback Loops – Rapid data analysis (e.g., machine learning) helps dynamically adjust allocations.

The multi-armed bandit problem illustrates this tradeoff:

  • Optimizing immediate gains while gathering information for future rewards

In trading, reinforcement learning might guide real-time adjustments. In business, agile methodologies enable iterative testing. 

Ultimately, success lies in dynamically calibrating the ε (exploration rate) to align with evolving goals, risks, and opportunities. This is so neither complacency nor a lack of focus on what’s profitable dominates.

Thompson Sampling: Bayesian Approach to Uncertainty

Thompson Sampling uses probability distributions to model uncertainty.

Each “arm” (e.g., a stock) is assigned a reward distribution updated after each trade. The algorithm samples from these distributions to choose the next action.

For instance, a trader might model Microsoft (MSFT) and Apple (AAPL) as beta distributions reflecting past returns.

If MSFT’s distribution suggests a 70% chance of outperforming AAPL, the algorithm allocates more funds to MSFT – but occasionally tests AAPL to refine its estimates.

This method can do better in volatile markets where asset performance is erratic.

Upper Confidence Bound (UCB): Balancing Risk and Reward

The UCB algorithm prioritizes options with the highest “upper confidence bound,” calculated as:

UCB = Average Reward + √(2 * ln(Total Plays) / Plays per Arm)

The first term rewards exploitation; the second penalizes under-explored arms.

In trading, UCB might favor a biotech stock with moderate returns but high growth potential over a stable utility stock.

 

Applications of MAB in Trading

Dynamic Asset Allocation

MAB shines in portfolio optimization, where assets compete for capital.

Consider a robo-advisor managing a portfolio of ETFs:

  1. Arms – ETFs (e.g., SPY, GLD, TLT).
  2. Rewards – Daily returns adjusted for risk.
  3. Action – Adjusting weightings weekly.

A Thompson Sampling-based system could shift allocations toward sectors showing momentum (e.g., tech stocks during an AI boom) while keeping exposure to gold as a hedge that’s largely uncorrelated to other financial assets long term.

Strategy Selection and Optimization

Traders often juggle multiple strategies: arbitrage, trend-following, mean reversion, and so on.

MAB treats each strategy as an arm, rewarding those with the highest Sharpe ratios.

For example, a quant fund might use UCB to decide between:

  • A high-frequency arbitrage bot (low returns, high consistency).
  • A leveraged futures strategy (high returns, high drawdowns).

UCB’s balance helps avoid overcommitting to the volatile futures strategy despite its occasional windfalls.

Adaptive Order Execution

Executing large orders without moving markets is an art. MAB models can choose between liquidity pools (e.g., dark pools, exchanges) to minimize slippage.

Imagine an institutional trader selling 100,000 shares. The algorithm:

  1. Exploits dark pools for low-impact sales.
  2. Explores smaller exchanges for better prices.

Over time, it learns which venues offer optimal fills for specific order sizes.

Risk Management and Drawdown Mitigation

MAB isn’t just about maximizing returns, but also about survival. 

Incorporating risk metrics (e.g., Value-at-Risk) into reward functions enables algorithms to avoid bets that don’t fit in with sensible risk management.

A hedge fund might use epsilon-greedy with guardrails. 

For example, exploration is limited to low-volatility assets during market downturns. This prevents “gambling” on speculative stocks when capital preservation is critical or when the edge isn’t high enough.

 

Challenges and Limitations of MAB in Trading

Non-Stationary Markets and Concept Drift

Financial markets are non-stationary – today’s patterns may vanish tomorrow. 

A MAB model trained on 2021’s meme-stock frenzy might overexploit volatile small-caps in a 2022 bear market.

Solutions:

  • Sliding Windows – Only use recent data (e.g., past 30 days), use data over longer timeframes, or use synthetic data to simulate various environments.
  • Decay Factors – Weight newer data more heavily.

High-Dimensional Action Spaces

What if a trader must choose between 10,000 stocks? Traditional MAB scales poorly with arms.

Workarounds:

  • Clustering – Group similar assets (e.g., tech stocks).
  • Contextual Bandits – Use metadata (e.g., P/E ratios) to generalize across arms.
  • Feature Engineering – Reduce complexity by selecting key factors (e.g., momentum, volatility, earnings growth) to filter and prioritize stocks before applying MAB.
  • Deep Reinforcement Learning (DRL) – Use neural networks to approximate value functions, allowing the model to generalize decisions across a large universe of assets.

Transaction Costs and Slippage

Frequent rebalancing burns capital through fees and slippage. A MAB algorithm might cycle between preferences frequently, eroding profits.

Mitigations:

  • Batch Updates – Rebalance weekly, not hourly.
  • Cost-Aware Rewards – Subtract transaction costs from reward calculations.

 

Future Directions and Hybrid Approaches

Integrating MAB with Reinforcement Learning (RL)

While MAB handles simple “stateless” decisions, RL adds temporal depth – i.e., considering how today’s trade affects tomorrow’s market. 

Hybrid models could use MAB for tactical decisions (e.g., picking stocks, adjusting leverage) and RL for strategic shifts (e.g., how to adjust the asset allocation).

Contextual Bandits for Personalized Trading

Contextual bandits use external data (e.g., Fed rate changes, earnings reports) to inform decisions. 

A model might learn that value stocks outperform before rate hikes and adjust allocations pre-emptively.

Ethical Considerations and Regulatory Compliance

MAB’s adaptability raises questions. 

Could it manipulate markets by exploiting liquidity gaps? 

Traders must audit algorithms for fairness and transparency – especially when managing client funds.

 

Conclusion

Multi-Armed Bandit methods provide traders a mathematically rigorous way to deal with uncertainty and better optimize their processes.

From optimizing portfolios to executing orders, MAB’s blend of exploration and exploitation aligns with how markets “learn” over time. 

Yet challenges like non-stationarity and transaction costs require doing this carefully, and not adhering to academic models that don’t capture the nuances of what goes into trading in the real world. 

As trading grows more algorithmic, MAB’s role will expand – especially when fused with AI techniques like RL. 

For traders, the lesson is clear: adapt or eventually get left behind.