What you will learn from this tutorial:
- Why Actor–Critic exists, and why Q-learning/DQN and pure gradient policy are not enough for real problems.
- What are the real limitations of value-based methods and policy-gradient methods: variance, stability, late feedback, weak exploration, difficulties in continuous actions.
- How Actor–Critic solves these problems, by clearly separating the roles: actor = decision, critic = evaluation, and by introducing stable feedback through TD-learning.
- How the Actor–Critic cycle works in practice, step-by-step: observation –> action –> reward –> evaluation –> policy and values update.
- Why stability in RL is not random, how the Critic reduces the gradient variance, and what is the trade-off between stability (low variance) and bias.
- What does a Critic “too weak” or “too strong” mean in practice, how this looks in TensorBoard and why the Actor sometimes seems “crazy” when, in fact, the Critic is the problem.
- How to choose correctly between V(s), Q(s,a) and Advantage, what each variant changes in the learning dynamics and why Advantage Actor–Critic is the modern “sweet spot”.
- How the theory connects to real algorithms: how “Actor–Critic from the book” becomes A2C, A3C, PPO, DDPG, TD3 and SAC.
- The clear difference between on-policy and off-policy, what it means in terms of sample efficiency and stability, and when to use each approach.
- Why PPO is the “workhorse” of modern RL, and in which situations SAC outperforms it, especially in robotics and continuous control.
- In which real-world scenarios does Actor–Critic really matter, from robotics and locomotion to finance, energy and industrial systems where data stability and efficiency are critical.
- How to use Gymnasium intelligently, not as a game: what problems do CartPole, Acrobot and Pendulum solve and what insights do you transfer directly to real robots.
- What does a functional Actor–Critic look like in reality, without long code: the logical structure for discrete and continuous action spaces.
- What are the hyperparameters that really matter (actor vs critic LR, discount, PPO clipping, SAC temperature) and how do they influence stability and performance.
- What graphs should you watch as a professional, not as a beginner: value loss, policy loss, entropy, reward, TD-error and what they tell you about the health of the agent.
- The real pitfalls that many don’t tell you, such as unstable Critic, bad reward scaling, lack of normalization, wrong entropy or blocked policy.
- Why Actor–Critic isn’t just theory, but has become the foundation of modern RL — and why, if you understand Actor–Critic, you understand virtually all of RL that matters in the real world.
TABLE OF CONTENTS
- 1. Why Actor–Critic Exists: When Q-Learning and Pure Policy Gradients Aren’t Enough
- 2. The Truth About Stability in RL: Why Your Agent Learns… or Collapses
- 3. From A2C to PPO, TD3 and SAC: Meet the Actor–Critic Family and When to Use Each
- 4. If You Want Real Robots, Start Here: Actor–Critic in Continuous Action Spaces
- 5. Conclusion
1. Why Actor–Critic Exists: When Q-Learning and Pure Policy Gradients Aren’t Enough

Policy-Based –> Powerful but Unstable
Actor + Critic = Stability + Power
What are the concrete limitations of value-based methods (Q-learning, DQN) and policy-gradient in real problems?
1.1 The problem with value-based methods (Q-learning, DQN)
They try to learn a function – Q(s,a) – and derive the decision from it.
In practice:
- Continuous actions = disaster or complicated hacks.
- In continuous control you can’t do argmax over infinite actions(argmax (argument of the maximum) finds the input value(s) (argument) that yields the highest output from a function).
- That’s why DQN works well in Atari, but not in locomotion, robotics, or fine control.
Info: DQN can be adapted for hybrid actor-critic continua (e.g. DDPG), making the transition less “disastrous.”
Implicit and cumbersome policy
- The policy is not learned explicitly.
- It’s only “indirect” by maximizing Q –> this makes exploration difficult and unstable.
Stability
In combination with neural networks, Q-learning is prone to divergence.
You need:
- target networks,
- experience replay,
- many “engineering tricks” just to keep it from exploding.
In conclusion, Q-learning / DQN are good for discrete actions, for simple environments or pixel games–finite actions. But in real problems with continuous actions and complex dynamics, they become a brake.
1.2 The problem of pure policy-based methods
Here the policy π(a∣s) is learned directly. In theory elegant. In practice:
Very high variance
- The policy gradient is based on entire episodes.
- Learning is unstable and slow.
- You have “policy oscillations”: learn, forget, jump around.
Late feedback
- The algorithm sees the reward only at the end and “pushes” the global gradient.
- It does not learn step-by-step, but retroactively.
There is no “opinion” of an evaluator
- Policy has no internal mechanism to say: “Was this action better than I expected or worse?”
- No evaluator –> chaotic learning.
In other words, policy gradient is nice in theory, good in introductory articles, but hard to use as a basis for serious systems.
1.3 How does actor-critic combine the advantages of both worlds in a unified framework?
Actor–Critic says: “Let’s have two brains in the agent: one that decides and one that judges.”
Actor
- represents policy,
- decides action directly,
- learns to choose smooth, continuous, intelligent moves.
Critic
- learns a value function (V, Q or Advantage A),
- estimates how good an action was in context,
- produces a fast correction signal (TD-error).
What happens practically?
1. The actor takes action.
2. Environment gives reward + new state.
3. Critic evaluates: “better than expected?” or “worse?“
4. The actor immediately adjusts in the right direction.
Result:
- healthier exploration,
- more stable learning,
- lower variance than reinforce(policy gradient),
- native support for continuous actions.
Actor + Critic = decision + instant feedback, in a coherent cycle.
1.4 In which real-world scenarios is actor-critic suitable?
Robotics and continuous control
- bipedal gait,
- robotic arm control,
- stabilization,
- manipulation,
- autonomous vehicles.
Here the actions are continuous, not discrete buttons:
- angles,
- torques,
- accelerations.
The Actor-Critic algorithms (PPO / SAC) are the standard.
Finance / trading / market making
- decisions are continuous or nearly continuous,
- cost of failure is high,
- sample efficiency matters.
Energy / power control / telecom / industry
- parameter tuning,
- continuous optimization,
- large, noisy systems,
- need for stability.
Actor-critic algorithm is preferred because it can learn online, can operate stably and handles uncertainty better.
1.5 How much time and frustration does actor-critic save vs. other solutions?
Fewer wasted iterations results in lower variance, meaning less “trial-and-pray.”
Fewer engineering tricks to keep it from exploding
Compare:
- DQN: target network + replay + clipping + all sorts of fireworks,
- PPO / SAC: direct stable design.
Go from demo to real robots much easier
With Actor-Critic, the transition from simulation to real robot is predictable, stable, and reproducible.
Less psychological frustration
Instead of seeing how the agent works, then breaks down completely, you see how the agent gets better and better.
This matters a lot for someone learning or building real systems.
1.6 Why has actor-critic become the de facto standard?
Because it has a solid theoretical foundation. It is based on:
- Policy Gradient Theorem,
- Temporal Difference Learning,
- Advantage Estimation,
- It is mathematically elegant and justified.
It is stable in practice, not just in theory. It works in logistics, robotics, and industry.
It scales to large problems. It works with deep networks, runs in complex environments, and supports parallelized training.
In modern RL, there are Actor-Critic algorithms in refined variants:
- A2C
- A3C
- PPO
- DDPG
- TD3
- SAC
If you understand Actor-Critic, you understand modern RL.
2. The Truth About Stability in RL: Why Your Agent Learns… or Collapses

2.1 How does the Critic reduce the variance of the policy gradient and what bias trade-off does it introduce?
In REINFORCE (pure gradient policy), the policy is updated using the full episode return. This means:
- rare feedback,
- high noise,
- unstable gradient,
- high implicit luck.
Mathematically, the gradient depends on entire episodes, which results in high variance. In practice, two very similar episodes can produce very different updates, just from stochastic fluctuations.
What does the Critic change?
The Critic introduces a learned value function (V, Q or Advantage) and a local learning signal:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-8c93cb7796264506aa74a893af6de414_l3.png)
Instead of waiting for the end of the episode, the policy receives feedback “in real time.”
The result is that:
- there is less reliance on luck,
- we have more stable and frequent feedback,
- we have more precise corrections,
- we have much lower gradient variance.
In practical terms, instead of “chaotic learning“, you have coherent progress.
But this stability comes at a cost: bias
The critic learns an approximation. Any approximation can be wrong. This introduces bias, i.e. systematic errors in the estimation of the value.
In other words, if the Critic is consistently wrong in the same direction, it will push the policy in the wrong direction.
The principle we need to remember is that the policy gradient is:
- theoretically unbiased (no bias),
- enormously noisy (high variance).
and Actor–Critic:
- much more stable (low variance),
- can have bias from the critic.
Actor–Critic is a smart engineering strategy. We will have some bias, but it gets rid of chaos. This strategy wins in most real problems.
2.2 What does a “too weak” or “too strong” Critic mean in practice?
The critic is the engine of stability. But it can be too weak or too aggressive. Both cases destroy performance.
When the Critic is too weak, it means that:
- the value function learns hard,
- its estimates are unstable,
- TD-error chaotic.
Symptoms in graphs like TensorBoard / training are:
- large and oscillating value loss for a long time,
- very noisy advantage distribution,
- “sawtooth” learning curves (rise–fall–rise–fall),
- the average reward does not converge, it just pulsates,
- the policy seems to learn something –> then forget –> then learn again –> fall again.
What does it mean conceptually?
The copilot (Critic) stutters contradictory instructions. The driver (Actor) learns poorly and uncertainly.
When the Critic is too strong, it means that:
- the Critic trains much faster than the Actor,
- the values seem “very confident,”
- but do not reflect reality.
Symptoms in graphs (TensorBoard example):
- artificially small value loss (but unrelated to policy improvement),
- the reward does not increase or even decreases,
- very small actor loss –> but poor performance,
- entropy decreases too quickly –> the policy becomes prematurely rigid.
Conceptually, it means that the co-pilot becomes “arrogant.” He says with too much certainty what is right and what is wrong, even when he is wrong. The Actor believes him and goes into the abyss.
What we need to remember here is that stability does not come from just any Critic. It comes from a Critic:
- not too slow,
- not too aggressive,
- aligned with the rhythm of the Actor.
That’s why sometimes:
- separate learning rates,
- normalize the advantage,
- use gradient clipping.
2.3 How do we choose between V, Q and A for the Critic?
The Critic can learn:
- V(s) –> the value of the state,
- Q(s, a) –> the value of the action in the state,
- A(s, a) –> how good the action is above average.
And although they are all “value functions”, they change the learning dynamics.
1. Standard Actor–Critic: Critic = V(s)
The Critic learns the value of the state.
Advantages:
- stable
- simple
- easy implementation
- clear feedback
Disadvantages:
- does not distinguish directly between actions,
- comparisons are “coarser.”
It is used massively in:
- A2C,
- A3C,
- many PPO setups.
It is the safe and healthy choice for most cases.
2. Critic = Q(s, a)
The Critic learns the value of each action in the state.
Advantages:
- more precise feedback per action,
- useful in deterministic continuous control (DDPG, TD3),
- allows more direct updates.
Disadvantages:
- harder to stabilize,
- more sensitive to errors,
- can explode if not regularized.
Used by:
- DDPG,
- TD3,
- some SAC variants.
3. Advantage Actor–Critic: Critic = A(s, a)
Instead of just learning the absolute value, learn: “How much better is the action than what was already reasonable to do?”
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle A(s, a) = Q(s, a) - V(s) \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-ce1939fc1f88ab6a7db7deb2f1ca61b6_l3.png)
Advantages:
- highly informative learning signal,
- low variance,
- fine balance between precision and stability.
It is the foundation for:
- A2C,
- A3C,
- PPO (via GAE — Generalized Advantage Estimation).
It is probably the modern “sweet spot.”
The clear principle is that if you want:
- simple stability –> V(s),
- extreme control and precision –> Q(s,a),
- optimal modern balance –> Advantage.
3. From A2C to PPO, TD3 and SAC: Meet the Actor–Critic Family and When to Use Each
How does Actor-Critic map from theory to concrete algorithms A2C, A3C, PPO, DDPG, TD3, SAC?
Actor–Critic in theory means:
- Actor = policy πθ(a∣s) that decides the action,
- Critic = a value function (V, Q or Advantage) that estimates how good the action was and gives the learning signal.
All modern successful algorithms are nothing more than engineered variants of this simple idea.
A2C / A3C — the “classic” Actor–Critic, educational, but practical
- Actor = stochastic policy,
- Critic = V(s) or Advantage,
- Update based on advantage estimation,
- A3C uses multiple workers in parallel (asynchronous),
- A2C is the synchronous, more stable version.
Why does it exist?
Those algorithms have better stability than REINFORCE, and the agent learns faster than simple value-based algorithms. Also, these are good to understand the Actor–Critic concept in its pure form. It is excellent as a mental foundation.
PPO — Seatbelted Actor–Critic
All Actor–Critic, but with a twist:
- Actor = policy
- Critic = value / advantage
- mechanisms are added to control how much the policy is allowed to change with each update
Clipping / trust region prevents aggressive updates and stabilizes learning massively.
Why does it exist?
For practical stability. It may not be the fastest, but it almost certainly learns well. That’s why it’s used everywhere in applied RL.
DDPG / TD3: Actor–Critic for continuous deterministic control
Here the actor is deterministic:
a=μθ(s)
Critic = Q(s, a)
DDPG:
- off-policy,
- uses replay buffer,
- it is fragile in continuous control and unstable.
TD3:
- solves DDPG problems,
- introduces “twin critics” to reduce overestimation,
- smoothing and delay for stability.
Why is it there?
For situations where you need very precise and deterministic actions, but want high sample efficiency.
SAC: Actor–Critic with exploratory intelligence
The actor is stochastic.
The critic = Q(s,a), maximizes policy entropy, so the agent learns to be competent. Also, it maintains healthy exploration.
The result is that the algorithm is extremely stable and sample-efficient, and can run on complex continuous control.
Why does it exist?
For situations where:
- you want very high stability,
- complex environment,
- intelligent exploration,
- and you want good results without much hassle.
All these algorithms are just the same fundamental Actor–Critic concepts implemented with different engineering solutions. Those algorithms provide:
- stability,
- value explosion,
- exploration,
- sample efficiency.
3.2 A2C/A3C (on-policy) vs SAC/TD3 (off-policy)
On-Policy (A2C / A3C / PPO)
The agent learns from data generated by the current policy. If the policy changes, the old experience becomes outdated.
Advantages:
- conceptually simpler,
- psychologically more stable,
- predictable behavior.
Disadvantages:
- costs “expensive” in data,
- each new policy –> new episodes,
- so low sample efficiency.
In other words, On-Policy algorithms are good for robust learning, but they “burn” a lot of episodes.
Off-Policy (SAC / TD3)
The agent can learn from:
- old experience,
- experience generated by other policies,
- replay buffer.
Advantages:
- excellent sample efficiency,
- you can reuse data much more,
- works great in expensive environments.
Disadvantages:
- more complicated to stabilize,
- requires engineering care,
- easy to introduce errors if the design is bad.
As a clear principle, if the environment is cheap, it is easy to run millions of episodes. În this case, on-policy is the perfect choice.
If the environment is expensive, rare, and hard, off-policy is the savior.
Therefore:
- games / light simulations –> PPO
- robotics + serious continuous control –> SAC / TD3
3.3 Why is PPO the modern “workhorse”?
And when does SAC beat it?
Why is PPO the “workhorse” of modern RL?
Because:
- starts up easily,
- rarely crashes,
- excellent stability,
- mature implementations,
- runs well “out of the box,”
- excellent support in SB3, RLlib, etc.
- is conceptually simple to understand.
It is the algorithm that labs, companies, robotics teams, and researchers choose when they want something that works without drama.
If someone asks “Where do I start?”; the answer is almost always PPO.
But when does SAC win?
SAC wins when:
- actions are continuous,
- complex environment,
- you want high sample efficiency,
- exploration must be intelligent,
- you want stability without excessive tuning.
In particular:
- locomotion,
- robotics,
- industrial control.
In many modern benchmarks, SAC beats PPO especially in convergence speed and consistency.
What we need to understand is that PPO is the reliable, robust car, with which you can go anywhere without emotions. SAC is the intelligent sports car: faster, more efficient, but technically more sophisticated.
1. All modern large algorithms (A2C, PPO, TD3, SAC) are just Actor–Critic in different forms.
2. On-policy (A2C / PPO):
- stable,
- clean,
- but eats a lot of samples.
3. Off-policy (TD3 / SAC):
- data-efficient,
- great for expensive environments,
- but requires more engineering.
4. PPO has become the standard because it offers the best combination of simplicity and reliability.
5. SAC is becoming the favorite in serious continuous control especially in robotics and in problems where stability and sample efficiency really matter.
4. If You Want Real Robots, Start Here: Actor–Critic in Continuous Action Spaces

Smooth Continuous Control → Actor–Critic4.1 On which Gymnasium environments does it make sense to learn Actor-Critic?
The purpose of classical environments is not to be “toys“, but to fix your correct intuitions without the cost of hardware or painful debugging.
CartPole
- discrete actions,
- learn stabilization and fast feedback,
- good for understanding Actor–Critic flow.
What you learn:
- TD learning,
- advantage,
- stable vs. oscillating behavior.
MountainCar
- rare reward,
- very slow progress at first.
What you learn:
- importance of advantage,
- stability in exploration,
- patience and weak signals.
Acrobot
- harder than it looks,
- dynamic movement required.
What you learn:
- implicit planning policy
- importance of stable critic
- errors accumulate quickly
Pendulum (first terminus for continuous control)
- continuous actions,
- fine torque control.
What you learn:
- exactly what you need for robots,
- how actor–critic behaves in actions continue,
- huge difference from value-based.
As a principle, don’t jump straight to robots. You build your RL brain in this order:
CartPole –> Acrobot –> Pendulum –> then MuJoCo / real robots.
4.2 How do you transfer the intuition from these environments to physical robots?
These environments are not “games”, they are mental models of real control.
Direct Parallels
State –> sensors
- in Gym: position, speed, angle.
- in robot: IMU, encoders, wheel speed, force, joint position.
Actions –> motor commands
- in Gym: torque / force / angle.
- in robot: PWM / current / servo setpoint / torque.
Reward –> physical goal
- in Gym: stabilization, return to center, deviation reduction.
- in robot: stable walking, precise position, low consumption.
What we have to learn here is that:
- Actor = controller that generates the command,
- Critic = evaluator that learns how good the command was,
- stability = smooth policy,
- loud noise = robot that shakes / oscillates,
- healthy advantage = stable learning.
We have to use the Gym to train our skills before risking money, time, motors and nerves. If something is unstable in the Gym, in the robot… it’s a catastrophe.
4.3 What does a minimal “skeleton” look like?
For Discrete Actor–Critic (mental model)
1. Observe the state
2. Actor produces distribution on actions
3. Choose action
4. Critic estimates the value of the state
5. Get reward + new state
6. TD-error
7. Update critic
8. Update actor
Continuum Actor–Critic
Key difference:
- actor produces a continuous value,
- critic works with Q or Advantage over continuous actions,
- usually:
- PPO → advantage,
- SAC / TD3 → Q(s,a).
The structure is the same. Only the representation of actions and the critic differ.
4.4 Hyperparameters that really matter
Not many, but these are critical:
Actor vs Critic learning rate
- if Actor learns too fast → becomes unstable,
- if Critic learns too fast → becomes “arrogant,”
- if Critic learns too slowly → becomes “stupid.”
As a golden rule, in most on-policy cases, the Critic is a little faster than the Actor, but not extreme.
- too small –> agent becomes short-sighted,
- too large –> unstable / noisy,
- usually 0.95 – 0.99 is the healthy range.
PPO clipping
- controls how aggressively the policy changes,
- too wide –> explosion,
- too tight –> slow learning.
SAC temperature
- controls how much exploration the policy has,
- too large –> chaotic robot,
- too small –> rigid robot.
Hyperparameters control stability, which controls the speed of learning.
4.5 What graphs to watch?
Here is the difference between “hope it works” and “we know what we are doing.”
Value Loss
- if it fluctuates violently –> unstable critic;
- if it is too small without improvement –> overconfident critic;
Policy Loss
- it must fluctuate, but have a coherent trend;
- completely flat –> dead learning;
- chaotic –> instability;
Entropy
- gradual decrease = good;
- decreases too fast = dead exploration;
- does not decrease at all = agent does not learn;
Average Reward
- should increase gradually;
- not in a violent zigzag to infinity;
Optional: TD-error
TD-error is very useful as an indicator of the critic’s “mental health.”
4.6 Real traps
Unstable critic
The actor seems crazy, but in fact the critic is the problem.
Bad reward scaling
- If it is too high –> saturation / explosion;
- If it is too low –> slow learning;
Lack of state normalization
RL loves clean data. Without normalization, the agent may learn weird behaviours.
Badly set entropy
- Too high exploration –> chaotic behavior;
- Too low –> blocked policy;
Too aggressive penalized policy
Sometimes excessive clipping / regularization results in a “frozen” agent.
5. Conclusion
Actor–Critic exists because classical methods are not sufficient in real problems. Q-learning / DQN are good for discrete actions and simple environments, but they collapse in continuous control and complex dynamics. Pure gradient policy is nice in theory, but has huge variance, late feedback and unstable learning. Actor–Critic combines what each lacks: the actor directly learns the policy and can smoothly control continuous actions, and the critic provides stable local feedback, reducing variance and accelerating learning.
The Critic brings stability, but also introduces bias, which in practice means that stability comes from a balance: the Critic should be neither too weak nor too aggressive. The choice between V(s), Q(s,a) and Advantage changes the learning dynamics: V gives simple stability, Q offers precision but requires care, Advantage is the modern balance used by A2C/A3C/PPO. This is where all the “big names” come from: PPO for robust stability and “first-time” results, TD3 for stabilized deterministic control and SAC for difficult, data-efficient and extremely stable continuous control.
In other words, Actor–Critic algorithms are preferred in robotics, continuous control, industry, finance and large systems because they are practical, stable, scalable and work in real environments, not just in articles. PPO is the “reliable machine” that gets you safely to your destination; SAC is the “smart and fast machine” for hard problems. On-policy is good when data is cheap; Off-policy is vital when data is expensive.
Gymnasium is not just a playground. It is the place where you form your intuitions correctly. CartPole, Acrobot, Pendulum build your foundation to understand what stability looks like, what healthy criticism means, what progress looks like, what chaos means and how to recognize it in graphs. When you move to real robots, you just replace: states –> sensors, actions –> motor commands, reward –> physical targets.
Basically, Actor–Critic saves time, data, nerves and frustration. Instead of “hoping it works”, you have a system that learns coherently and predictably. That’s why it’s not just theory, but the foundation of modern RL. If you understand Actor–Critic, you understand the RL that matters in real life.



