This page was last edited on 04 March 2026
Even though this tutorial should only talk about choosing an algorithm, a truth must be told before getting into it.
The algorithm matters much less than the Reward Function and State Representation. In the robotics industry, we don’t choose an algorithm because it’s the “newest“, but because it’s the easiest to debug and most robust to sensor noise.
Let’s go back to choosing the algorithm depending on the application.
If we choose the algorithm wrong:
- our agent may never converge to a good policy.
- training can take much longer.
- we might waste time, energy, and compute.
- the agent might learn unstable or unsafe behaviors.
The role of the algorithm in learning is important. Its main role is to determines how the agent learns. It controls how it explores the environment, how it updates knowledge, and how it generalizes across states.

ANALOGY
Choosing the right RL algorithm for a specific task, is like choosing the right tool for a job. You wouldn’t use a hammer to fix a watch. Similarly, we shouldn’t use DQN for continuous actions, or SAC for simple discrete problems. We have to match the tool with the problem.
Types of RL Algorithms
We group the RL algorithms based on their learning approach.
1. Value-Based
These algorithms learn a value function like Q(s, a). The agent picks actions with the highest value.
- Examples: Q-Learning, DQN
- Good for discrete actions
2. Policy-Based
These algorithms learn the policy directly (a mapping from states to actions).
- Examples: PPO
- Better for continuous actions or stochastic policies
3. Actor-Critic
These combine both. The actor chooses actions. The critic evaluates them.
- Examples: A2C, A3C, DDPG, TD3, SAC
- Good tradeoff between value and policy-based benefits
4. Model-Free vs Model-Based
- Model-Free: A model free algorithm learns only from experience. It is slower, but more general.
Examples: DQN, PPO, SAC. - Model-Based: It learns a model of the environment. Such an algorithm can be faster than the model-free. Examples: Dyna-Q, MuZero
Key Decision Factors
Before choosing the right algorithm, we have to evaluate these:
Action Space
- If our application have a discrete action space, we use DQN, PPO, A2C
- Otherwise, if the action space is continuous, we use DDPG, TD3, SAC
State Space
- If we have a small state space, is fine to use Q-Table or DQN
- For larger state spaces such as vectors or images, we use Deep RL algorithms such as DQN, PPO, SAC
Resources
- If we’re deploying on low-budget hardware such as Raspberry Pi or Jetson Nano, we can start with DQN or PPO.
- For high-end hardware (GPU), we can use SAC, TD3
- In case that a software simulator such as MuJoCo, Gazebo Sim, Isaac Sim is not available available, it is preferable to choose a sample-efficient algorithm such as SAC or TD3
Stability
- PPO and SAC are more stable, which is critical when training on physical hardware or in environments where unpredictable agent behavior can lead to hardware damage, unsafe states, or long-term divergence
- DDPG and A2C can be sensitive to hyperparameters
Exploration vs Exploitation
Robustness
- SAC and PPO generalize better in noisy or real environments
Comparison Table & Quick Guide
| Algorithm | Action Space | Sample Efficiency | Stability | Best Use Case |
|---|---|---|---|---|
| Q-Learning | Discrete | High | High | Grid worlds, small tasks |
| DQN | Discrete | Medium | Medium | Atari games, simple robotics |
| PPO | Discrete/Continuous | Medium | High | General purpose, reliable |
| DDPG | Continuous | Medium | Low | Control tasks, robotics |
| TD3 | Continuous | High | High | Precision control tasks |
| SAC | Continuous | High | Very High | Complex, real-world tasks |
Real Examples
Simple Case: DC Motor PWM Control
- Discretize PWM: Use DQN or PPO
- Use continuous actions: Use DDPG or SAC
- SAC is best if the system is noisy
Intermediate Case: Maze Navigation
- Discrete grid: DQN or PPO
- Complex maze with uncertainty: PPO with entropy tuning
Complex Case: Visual Navigation for a Robot
- Inputs: Camera images, actions: continuous
- Best choices: PPO, TD3, SAC
- SAC handles noise and high-dimensional input well
Final Checklist
Key Questions:
- Are the actions discrete or continuous?
- Is the state space small or large?
- Do we have limited hardware resources?
- Do we need fast convergence?
- Do we need stability over time?
- Can our agent learn from scratch or needs a model?
Decision Matrix:
- Discrete + low resources → DQN
- Discrete + stable, general purpose → PPO
- Continuous + low stability → TD3 or PPO
- Continuous + robust to noise → SAC
- Complex environments + high-dimensional input → SAC or PPO
References
- Lillicrap et al. (2016). Continuous control with deep reinforcement learning (DDPG)
- Schulman et al. (2017). Proximal Policy Optimization Algorithms (PPO)
- Fujimoto et al. (2018). Addressing function approximation error in actor-critic methods (TD3)
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
Markov Decision Process(MDP) << Previous | Next >> Epsilon Greedy