Fiveable

🧠Machine Learning Engineering Unit 11 Review

QR code for Machine Learning Engineering practice questions

11.3 Multi-Armed Bandits and Reinforcement Learning

11.3 Multi-Armed Bandits and Reinforcement Learning

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🧠Machine Learning Engineering
Unit & Topic Study Guides

Multi-armed bandits and reinforcement learning tackle the exploration-exploitation dilemma in decision-making. These techniques balance gathering new info with maximizing immediate rewards, crucial for optimizing outcomes in uncertain environments.

From epsilon-greedy to deep Q-networks, these methods power everything from A/B tests to game-playing AIs. They're key to making smart choices when you don't have all the facts, whether you're picking ads or training robots.

Exploration vs Exploitation Trade-off

Fundamental Concepts

  • Exploration-exploitation trade-off balances gathering new information and maximizing immediate rewards in sequential decision-making
  • Exploration gathers information about environment or possible actions for better future decisions
  • Exploitation maximizes immediate rewards based on current knowledge
  • Trade-off particularly relevant in scenarios with limited resources or time constraints (opportunity cost for each decision)
  • Mathematical formulations involve probability distributions and expected values of rewards for different actions
  • Applicable across various domains (machine learning, artificial intelligence, operations research, adaptive control systems)

Strategies and Considerations

  • Epsilon-greedy methods select best-known action with probability 1-ε and explore randomly with probability ε
  • Upper confidence bound algorithms maintain confidence intervals for expected reward of each arm
  • Thompson sampling uses Bayesian approach with probability distributions over expected rewards
  • Optimal balance varies depending on problem structure, time horizon, and environmental uncertainty
  • Strategies aim to minimize regret (difference between optimal and actual performance) over time

Multi-armed Bandit Algorithms

Epsilon-Greedy Algorithm

  • Simple approach for multi-armed bandit problems
  • Maintains estimates of expected rewards for each arm
  • Updates estimates based on observed outcomes
  • Selects best-known action with probability 1-ε and explores randomly with probability ε
  • Higher ε values promote more exploration
  • Implementation involves tracking reward estimates and action counts
  • Example: In online advertising, ε-greedy could select ads with 90% exploiting best-known performer and 10% trying new options
Fundamental Concepts, Frontiers | Time and Action Co-Training in Reinforcement Learning Agents

Upper Confidence Bound (UCB) Algorithms

  • Use optimism in face of uncertainty to balance exploration and exploitation
  • Maintain confidence intervals for expected reward of each arm
  • Select arm with highest upper bound
  • UCB1 algorithm combines empirical mean reward with exploration bonus
  • UCB1 formula: UCB1=Xˉj+2lnnnj\text{UCB1} = \bar{X}_j + \sqrt{\frac{2\ln n}{n_j}}
    • Xˉj\bar{X}_j: empirical mean reward of arm j
    • nn: total number of pulls
    • njn_j: number of times arm j has been pulled
  • Automatically adjusts exploration based on uncertainty
  • Example: In clinical trials, UCB could guide selection of treatments, balancing known efficacy with potential of unexplored options

Thompson Sampling

  • Bayesian approach for multi-armed bandit problems
  • Maintains probability distribution over expected rewards of each arm
  • Samples from these distributions to make decisions
  • Updates posterior distributions based on observed rewards
  • Naturally balances exploration and exploitation
  • Effective in practice, often outperforming simpler methods
  • Example: In A/B testing for website design, Thompson sampling could dynamically allocate traffic to different versions based on performance uncertainty

Reinforcement Learning Techniques

Q-learning Fundamentals

  • Model-free reinforcement learning algorithm
  • Learns action-value function (Q-function) representing expected cumulative reward
  • Based on Markov Decision Process (MDP) framework
  • Q-learning update rule: Q(st,at)Q(st,at)+α[rt+γmaxaQ(st+1,a)Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t)]
    • α\alpha: learning rate
    • γ\gamma: discount factor for future rewards
  • Iteratively updates Q-values based on observed rewards and maximum Q-value of next state
  • Handles environments with discrete state and action spaces
  • Example: Q-learning applied to game playing (Tic-Tac-Toe) learns optimal moves through repeated play
Fundamental Concepts, Reinforcement Learning

Policy Gradient Methods

  • Directly optimize policy (mapping from states to actions)
  • Use gradient ascent on expected cumulative reward
  • Useful for continuous action spaces and high-dimensional state spaces
  • REINFORCE algorithm uses Monte Carlo sampling to estimate policy gradients
  • Policy gradient theorem forms basis for many algorithms: θJ(θ)=Eπθ[θlogπθ(as)Qπθ(s,a)]\nabla_\theta J(\theta) = E_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s,a)]
  • Can incorporate function approximation (neural networks) for complex state spaces
  • Example: Policy gradients applied to robot control tasks learn smooth, continuous actions for navigation or manipulation

Deep Reinforcement Learning

  • Combines RL algorithms with deep neural networks
  • Handles complex, high-dimensional state spaces (images, sensor data)
  • Deep Q-Network (DQN) uses convolutional neural networks for Q-function approximation
  • Actor-Critic methods separate policy (actor) and value function (critic) learning
  • Proximal Policy Optimization (PPO) improves stability of policy gradient methods
  • Addresses challenges of sparse rewards and long-term credit assignment
  • Example: DeepMind's AlphaGo used deep RL to master the game of Go, defeating world champions

Algorithm Performance Evaluation

Evaluation Metrics

  • Cumulative regret measures total loss compared to optimal strategy over time
  • Simple regret focuses on quality of final recommendation or decision
  • Best arm identification rate assesses ability to find optimal action
  • Average return and discounted cumulative reward evaluate overall performance in RL
  • Learning speed (sample efficiency) measures how quickly algorithms improve
  • Online performance evaluates adaptation during learning process
  • Offline performance assesses generalization after learning completes

Real-world Applications and Challenges

  • A/B testing in online advertising and recommendation systems uses multi-armed bandits
  • Reinforcement learning applied in robotics, game playing, and resource management
  • Non-stationarity introduces time-varying rewards or state transitions
  • Partial observability limits access to complete state information
  • High-dimensional state spaces require efficient function approximation
  • Safety considerations crucial in physical systems (robotics, autonomous vehicles)
  • Scalability to large state/action spaces needed for practical applications
  • Example: Recommender systems use bandits to balance exploring new content and exploiting known user preferences

Robustness and Deployment Considerations

  • Algorithms must adapt to environmental changes in real-world scenarios
  • Evaluate performance across different initial conditions and random seeds
  • Consider computational requirements for real-time decision-making
  • Assess data efficiency to minimize costly interactions with environment
  • Balance exploration and exploitation in production systems
  • Implement safeguards against unexpected or adversarial inputs
  • Continuously monitor and update models in deployed systems
  • Example: Self-driving car algorithms must robustly handle diverse traffic scenarios and weather conditions
Pep mascot
Upgrade your Fiveable account to print any study guide

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Click below to go to billing portal → update your plan → choose Yearly → and select "Fiveable Share Plan". Only pay the difference

Plan is open to all students, teachers, parents, etc
Pep mascot
Upgrade your Fiveable account to export vocabulary

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Plan is open to all students, teachers, parents, etc
report an error
description

screenshots help us find and fix the issue faster (optional)

add screenshot

2,589 studying →