How to Choose an Reinforcement Learning Algorithm

This page was last edited on 04 March 2026

Even though this tutorial should only talk about choosing an algorithm, a truth must be told before getting into it.

The algorithm matters much less than the Reward Function and State Representation. In the robotics industry, we don’t choose an algorithm because it’s the “newest“, but because it’s the easiest to debug and most robust to sensor noise.

Let’s go back to choosing the algorithm depending on the application.

If we choose the algorithm wrong:

  • our agent may never converge to a good policy.
  • training can take much longer.
  • we might waste time, energy, and compute.
  • the agent might learn unstable or unsafe behaviors.

The role of the algorithm in learning is important. Its main role is to determines how the agent learns. It controls how it explores the environment, how it updates knowledge, and how it generalizes across states.

RL Algorithm Selection Guide
RL Algorithm Selection Guide

ANALOGY

Choosing the right RL algorithm for a specific task, is like choosing the right tool for a job. You wouldn’t use a hammer to fix a watch. Similarly, we shouldn’t use DQN for continuous actions, or SAC for simple discrete problems. We have to match the tool with the problem.


We group the RL algorithms based on their learning approach.

1. Value-Based

These algorithms learn a value function like Q(s, a). The agent picks actions with the highest value.

2. Policy-Based

These algorithms learn the policy directly (a mapping from states to actions).

  • Examples: PPO
  • Better for continuous actions or stochastic policies

3. Actor-Critic

These combine both. The actor chooses actions. The critic evaluates them.

  • Examples: A2C, A3C, DDPG, TD3, SAC
  • Good tradeoff between value and policy-based benefits

4. Model-Free vs Model-Based

  • Model-Free: A model free algorithm learns only from experience. It is slower, but more general.
    Examples: DQN, PPO, SAC.
  • Model-Based: It learns a model of the environment. Such an algorithm can be faster than the model-free. Examples: Dyna-Q, MuZero

Key Decision Factors

Before choosing the right algorithm, we have to evaluate these:

Action Space

  • If our application have a discrete action space, we use DQN, PPO, A2C
  • Otherwise, if the action space is continuous, we use DDPG, TD3, SAC

State Space

  • If we have a small state space, is fine to use Q-Table or DQN
  • For larger state spaces such as vectors or images, we use Deep RL algorithms such as DQN, PPO, SAC

Resources

  • If we’re deploying on low-budget hardware such as Raspberry Pi or Jetson Nano, we can start with DQN or PPO.
  • For high-end hardware (GPU), we can use SAC, TD3
  • In case that a software simulator such as MuJoCo, Gazebo Sim, Isaac Sim is not available available, it is preferable to choose a sample-efficient algorithm such as SAC or TD3

Stability

  • PPO and SAC are more stable, which is critical when training on physical hardware or in environments where unpredictable agent behavior can lead to hardware damage, unsafe states, or long-term divergence
  • DDPG and A2C can be sensitive to hyperparameters

Exploration vs Exploitation

  • SAC adds entropy to encourage exploration
  • DQN uses ε-greedy (less effective in continuous spaces)

Robustness

  • SAC and PPO generalize better in noisy or real environments

Comparison Table & Quick Guide

AlgorithmAction SpaceSample EfficiencyStabilityBest Use Case
Q-LearningDiscreteHighHighGrid worlds, small tasks
DQNDiscreteMediumMediumAtari games, simple robotics
PPODiscrete/ContinuousMediumHighGeneral purpose, reliable
DDPGContinuousMediumLowControl tasks, robotics
TD3ContinuousHighHighPrecision control tasks
SACContinuousHighVery HighComplex, real-world tasks

Real Examples

  • Discretize PWM: Use DQN or PPO
  • Use continuous actions: Use DDPG or SAC
  • SAC is best if the system is noisy
  • Discrete grid: DQN or PPO
  • Complex maze with uncertainty: PPO with entropy tuning
  • Inputs: Camera images, actions: continuous
  • Best choices: PPO, TD3, SAC
  • SAC handles noise and high-dimensional input well

Final Checklist

Key Questions:

  1. Are the actions discrete or continuous?
  2. Is the state space small or large?
  3. Do we have limited hardware resources?
  4. Do we need fast convergence?
  5. Do we need stability over time?
  6. Can our agent learn from scratch or needs a model?

Decision Matrix:

  • Discrete + low resources → DQN
  • Discrete + stable, general purpose → PPO
  • Continuous + low stability → TD3 or PPO
  • Continuous + robust to noise → SAC
  • Complex environments + high-dimensional input → SAC or PPO

References


Markov Decision Process(MDP) << Previous | Next >> Epsilon Greedy