AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
No Result
View All Result

Soft Actor Critic (SAC) Implementation In SB3 and PyTorch for Pendulum

by Dragos Calin
in Deep RL Algorithms, PyTorch, SAC, Tools, Code & Experiment Design
9
A A
0

Your agent may fail a lot of the time not because it’s trained badly or the algorithm is bad, but because Soft Actor-Critic (a special type of algorithm) doesn’t behave like PPO or DDPG at all.

In this tutorial, I’ll answer the following questions and more:

  • Why does Soft Actor-Critic(SAC) use two “brains” (critics)?
  • Why does it force the agent to explore?
  • Why does SB3 (the library) hide so many things in a single line of code?
  • And most importantly: How do you know that the agent is really learning, and not just pretending?

And finally, I share with you the script to train an agent with SAC to make an inverted pendulum stand upright.

TABLE OF CONTENTS

In this tutorial, I’ll cover the following subjects:

  1. WHAT SAC REALLY IS – AND WHY IT MATTERS
  2. WHY IMPLEMENT SAC USING Stable-Baselines3?
  3. SETTING UP THE ENVIRONMENT: Gymnasium + CONTINUOUS ACTION SPACES
  4. FULL SAC IMPLEMENTATION IN SB3 – TRAIN, AND EVALUATE
  5. CRITICAL SAC HYPERPARAMETERS – WHAT THEY DO AND WHY THEY MATTER
  6. PRACTICAL EXPERIMENTS: SAC vs PPO IN THE SAME ENVIRONMENT
  7. DEBUGGING SAC: WHAT TO CHECK WHEN YOUR AGENT DOESN’T LEARN
  8. CONCLUSION

1. WHAT SAC REALLY IS – AND WHY IT MATTERS

General Diagram of the Reinforcement Learning Training Flow (Pipeline)
General Diagram of the Reinforcement Learning Training Flow (Pipeline)

The basic idea behind Soft Actor-Critic (SAC) is that the algorithm learns a policy that makes good decisions while remaining “curious” enough not to get stuck on bad solutions. In other words, the agent tries to get a high reward and to act in a sufficiently varied way so as not to repeat the same mistake over and over again.

In SAC, a critic is a neural network that has the role of saying how good the action the agent just took is. In other words, the critic is the teacher who gives grades.

SAC uses two critics. A single critic is often too optimistic, meaning it estimates the reward higher than it really is. So two critics are used to give two estimates. SAC chooses the lower one to reduce false optimism. This makes learning more stable and safer.

The actor says: “What action should I do now?“

The critics check: “Was it a good or bad choice?“

The actor sends actions –> the critics evaluate them –> the actor adjusts to do actions that the critics consider good.

It’s like a child (the actor) trying something, and two teachers (the critics) tell him how well he did.

1.1 Maximum Entropy

One of the essential parts of the algorithm is Maximum Entropy.

With Maximum Entropy it means that the agent:

  • doesn’t just want a big reward,
  • also wants to be unpredictable enough (i.e. explore well).

Entropy helps in stable exploration in continuous actions.

For example, turning a motor:

  • without entropy, the agent can get stuck doing the same move over and over,
  • with entropy, the agent tries slightly different moves, but not completely random.

Basically, the agent explores “with his head“, not chaotically.

In continuous control, SAC is preferred compared to PPO or DDPG.

SAC is preferred because:

  • it is more stable than DDPG,
  • it is more efficient than PPO(PPO may be more stable in discrete or simple environments),
  • it has controlled exploration,
  • it can learn well with less data,
  • it works great in noisy environments.

In short, SAC combines PPO-like stability with DDPG-like expressiveness, while fixing DDPG’s stability issues.

SAC is very good in real problems where:

  • the action is a continuous value (not a left-right button),
  • you have motors, wheels, robotic arms, drones, manipulation,
  • you need to control forces, speeds, angles,
  • the reward is noisy or hard to intuit.

All real robotic applications fall into this category.

The conceptual difference between SAC (off-policy) and PPO (on-policy) is that:

SAC (off-policy)

  • It can learn from old data stored in a buffer,
  • It can use the same experience many times.

PPO (on-policy)

  • It only learns from new data,
  • Discard old data immediately.

In conclusion, SAC recycles data. PPO must always collect new data.

1.2 Sample-Efficient

Another important concept used in SAC is Sample-Efficient. This concept tells us how well the agent manages to learn using little data (few timesteps).

SAC is considered sample-efficient because:

  • uses a replay buffer,
  • learns many times from the same experiences,
  • does not discard data immediately.

Basically, SAC needs fewer timesteps to achieve the same performance.

Sample-efficiency is measured by:

  • how many timesteps are needed for the agent to reach a certain level of reward,
  • how fast the learning curve increases.

For example:

If SAC reaches reward -200 at Pendulum(OpenAi Gymnasium) in 30k timesteps, and PPO in 70k, SAC is more sample-efficient.


2. WHY IMPLEMENT SAC USING Stable-Baselines3?

Implement SAC From Scratch vs Using Stable-Baselines3
Implement SAC From Scratch vs Using Stable-Baselines3

There are many reasons to implement SAC using Stable-Baselines3 (SB3) and not from scratch. SB3 gives us everything already built and tested.

If we write the entire SAC training pipeline from scratch, we have to implement it ourselves:

  • two critics,
  • the actor,
  • soft updates,
  • replay buffer,
  • target networks,
  • entropy tuning,
  • gradient steps.

In SB3 all of these are already correct, stable and optimized.

More than this, SB3 automatically manages:

  • actor,
  • critics (Q1 and Q2),
  • critic targets,
  • soft target update (τ),
  • ent_coef auto-adaptive,
  • replay buffer,
  • PyTorch optimizations,
  • gradient steps for actor and critics,
  • logging and statistics.

SB3 is safe for beginners because:

  • the default hyperparameter values ​​are already tested in many environments,
  • we don’t have to adjust anything to make it work,
  • offers reproducibility through clear seed settings,
  • has stable, academically verified implementations,
  • prevents common mistakes (gradient clipping, soft-update, etc.).

The 5-step SB3 training flow is:

  1. Create env (e.g., gym.make(‘Pendulum-v1’)),
  2. Create a model (e.g., model = SAC(‘MlpPolicy’, env)),
  3. Call .learn(total_timesteps=100000),
  4. SB3 collects rollouts, stores in replay buffer, samples batches at train_freq (default 1), updates actor/critics/entropy,
  5. Save (.save()) and load (.load()) for evaluation/demo.

3. SETTING UP THE ENVIRONMENT: Gymnasium + CONTINUOUS ACTION SPACES

Setting Up the Pendulum-v1 Environment
Setting Up the Pendulum-v1 Environment

In an environment with continuous action space, the action is not a button (left/right). It is a real number.

The agent controls continuous values, like a real motor.

SAC works only with continuous actions. So it is preferable that before starting training, we check if an environment is compatible with SAC.

The check is done simply, using:

env.action_space
  • If it is Box(low, high) –> compatible with SAC,
  • If it is Discrete(n) –> NOT compatible.

3.1 Why is Pendulum-v1 an ideal example to start with?

Because:

  • it has continuous action (a motor),
  • it is simple, but difficult enough for RL to be interesting,
  • the reward is clear: keep the pendulum straight,
  • it is fast to train,
  • it is used in many tutorials and RL benchmarks.

It is exactly what you need to understand SAC without complications.

3.2 How does it work and what does the agent have to control in Pendulum?

The agent controls the torque of the motor.

Its objective:

  • to raise the pendulum to the vertical position,
  • to stabilize it there,
  • using as little energy as possible.

The pendulum is like a short arm that wants to stay upright without falling.

3.3 What are the dimensions of the action and observation and why does this matter?

In Pendulum-v1:

  • action space: Box([-2], [2]) –> 1 continuous action,
  • observation space: 3 values ​​(cos(θ), sin(θ), angular_velocity).

Why does it matter?

  • the actor must have exactly 1 output,
  • the critics must evaluate a vector of 3 + 1 values,
  • the neural networks must be configured correctly.

If dimensions don’t match, the neural networks cannot compute valid actions or Q-values.

3.4 Normalizing actions and how Gymnasium does it

For stability:

  • large actions –> can produce unstable dynamics,
  • too small actions –> the agent cannot control the system.

Gymnasium does not normalize actions; it only rescales them to the environment’s action bounds.

action = env.action_space.low + (scaled_action + 1)/2 * (env.action_space.high - env.action_space.low)

SB3 produces actions in [-1,1] before scaling (in squash_output=True default for SAC), and Gymnasium applies the transformation in step().

3.5 Minimal reward shaping for a continuous environment

Minimal reward shaping means:

  • do not modify the reward,
  • do not add artificial bonuses,
  • keep the original reward of the environment.

At Pendulum:

  • penalize the wrong angle,
  • penalize high angular velocity,
  • penalize energy consumption.

Without additional shaping, training is correct and reproducible.


4. FULL SAC IMPLEMENTATION IN SB3 – TRAIN, AND EVALUATE

In this chapter we approach training the SAC agent. The minimum steps to train and demo a SAC agent in SB3 are:

1) Create the environment

env = make_env(render_mode=None)

2) Create the SAC model

model = SAC("MlpPolicy", env, learning_rate=lr, ...)

3) Call .learn()

model.learn(total_timesteps=timesteps, progress_bar=True)

4) Save the model

model.save(model_path)

5) Load model and run evaluation/demo

model = SAC.load(model_path)

That’s it. SB3 does the rest automatically.

4.1 Interpreting console output during training

Console Output During The Training
Console Output During The Training

The image above is part of the console messages during the training. It is a message with the parameter values ​​towards the end of the training.

Complete interpretation of each parameter in the console:

ROLL OUT SECTION (about agent behavior)

  • ep_len_mean = 200: means that an episode lasts on average 200 timesteps. In Pendulum-v1, the episode has exactly 200 timesteps, so it’s normal.
  • ep_rew_mean = -150: average rewards per episode is approximately –150. For Pendulum:
    • reward starts at ~ -1500,
    • a good agent reaches between -200 and -100.

TIME SECTION (time statistics)

  • episodes = 992: in 198k timesteps, SAC has run almost 1000 episodes.
  • fps = 93: the model runs at 93 timesteps per second. It’s very good for SAC + Pendulum.
  • time_elapsed = 2114 sec: ran for about 35 minutes.
  • total_timesteps = 198400: the training has almost 200,000 timesteps.

TRAIN SECTION (about neural networks)

  • actor_loss = 15.9: a large or small actor loss does NOT mean good or bad.
  • critic_loss = 0.344: The critic estimates Q-values.
    • critc_loss small (0.3) = very good
    • critc_loss large (> 10) = unstable
  • ent_coef = 0.0214: the entropy exponent controls exploration.
  • ent_coef_loss = -0.03: this is how much the entropy is adjusted.
  • learning_rate = 0.0003: it is the value of learning rate (lr) chosen by the user. SB3 displays it for information only.
  • n_updates = 198299: Something important:
    • SAC does an update at every step,
    • so almost every timestep = an update,
    • If n_updates ≈ timesteps –> EVERYTHING is correct.

5. CRITICAL SAC HYPERPARAMETERS – WHAT THEY DO AND WHY THEY MATTER

Critical SAC Hyperparameters
Critical SAC Hyperparameters

5.1 What does learning_rate do in the context of actor and critics?

Learning rate controls how fast we update the actor and critic networks.

  • if it is too high –> the actor and critics skip the correct solution –> unstable learning results,
  • if it is too low –> learning becomes slow and we need many more timesteps.

I explained how to choose learning_rate in another tutorial. You can read the tutorial here: The Complete Guide of Learning Rate in RL

In SAC, the actor and critics have the same lr, so a bad choice affects the entire agent.

5.2 What role does buffer_size play in sample efficiency?

The replay buffer is the reason why SAC is sample-efficient.

  • a large buffer (e.g. 100k–1M) –> the agent can learn from many old experiences,
  • a small buffer –> the agent only learns from the most recent transitions –> less stable.

The larger the buffer_size, the better the agent:

  • recycles data,
  • has a more accurate estimate of Q-values,
  • becomes more stable in continuous environments.

5.3 Why does batch_size influence stability in continuous control?

Batch size = how many transitions are used in a single update.

  • small batch (32–64) –> unstable updates, noisy Q-values,
  • large batch (256–512) –> more stable estimates, smoother learning.

In environments with continuous actions (Pendulum, MuJoCo, robotics):

➡ large batch_size = stable critics + stable policy

That is why SAC, by default, uses batch_size = 256 (in PPO it is the other way around).

5.4 What is ent_coef and how does entropy self-adjust in SB3?

ent_coef controls how much the agent explores. In SAC, we have “Maximum Entropy RL“, so entropy is part of the objective.

In SB3:

ent_coef = "auto"

Means:

  • the agent learns by itself how much exploration it needs,
  • the entropy is high at the beginning (exploration),
  • it naturally decreases when the policy becomes secure.

It is much more stable than a fixed coefficient.

5.5 How does tau influence the updating of target networks?

tau controls the soft update of target critics:

target_param = tau * online_param + (1 - tau) * target_param
  • high tau (0.01–0.02) –> fast updates, but agents can become unstable,
  • low tau (0.005 — the default value) –> slow and stable updates.

Target critics are the stability “anchor” of SAC.

5.6 Why can gamma affect stability in environments with slow dynamics?

Gamma = discount factor.

I explained in another tutorial how to choose discount factor and what exactly it influences. You can read the tutorial here: Discount Factor (gamma) Explained With Q-Learning + CartPole

In slow environments (pendulum, robotic arms):

  • high gamma (0.99–0.999) –> agent makes long-term, stable decisions,
  • low gamma (0.90–0.95) –> agent only sees immediate effects –> unstable, learns chaotically.

In continuous control, high gamma is standard because dynamics can last hundreds of timesteps.

➡ low gamma = critics cannot estimate Q-values ​​correctly.

5.7 What is the importance of train_freq and gradient_steps in SAC?

In SAC:

  • train_freq = how often we update,
  • gradient_steps = how many updates we make at each train step.

Default:

train_freq = 1
gradient_steps = 1

This means that:

  • at each step –> we extract a batch –> we do an update,
  • n_updates ≈ total_timesteps (as you noticed in the console)

If you want more stable learning:

  • higher gradient_steps = more intense learning,
  • train_freq = 1 is perfect in 99% of cases for SAC.

5.8 How do you choose hyperparameter values ​​for toy tasks vs robotics tasks?

5.8.1 Toy tasks (Pendulum, MountainCarContinuous, LunarLanderContinuous)

Purpose: fast learning.

Recommendations:

  • lr: 3e-4
  • buffer_size: 100k
  • batch_size: 256
  • ent_coef: auto
  • tau: 0.005
  • gamma: 0.99
  • timesteps: 50k–200k

Reason: the environments are simple and do not require massive data.

5.8.2 Robotics tasks (MuJoCo, Isaac, Unity, robotic arms, mobile robots)

Goal: maximum stability + smooth behavior.

Recommendations:

  • lr: 1e-4
  • buffer_size: 500k–2M
  • batch_size: 512
  • ent_coef: auto
  • tau: 0.005 (never higher)
  • gamma: 0.995–0.999
  • timesteps: 1M–20M

Reason: complex dynamics + noise + need for stable Q-values.


6. PRACTICAL EXPERIMENTS: SAC vs PPO IN THE SAME ENVIRONMENT

In continuous control, actions are real numbers (e.g. torque = 0.58).

These require the finesse of a policy that can adjust small, precise and stable.

SAC offers exactly that: soft Q-learning + maximum entropy + off-policy learning.

Visible Differences Between SAC and PPO Learning Curves
Visible Differences Between SAC and PPO Learning Curves

SAC graph

  • starts at ~ -1400,
  • rapidly reaches ~ -300 in 20k timesteps,
  • then stabilizes at ~ -200 (near optimal),
  • very smooth curve, without major oscillations.

PPO graph

  • starts at ~ -1200,
  • deteriorates to ~ -1300,
  • then slowly climbs to ~ -500 as it approaches ~200k timesteps.

The obvious result:

  • SAC learns 5–10×faster,
  • SAC is stable,
  • PPO is slow, oscillating and reaches a worse performance.

SAC quickly climbs to optimal performance and stabilizes, while PPO needs 10x more timesteps and still doesn’t achieve the same performance.

6.1 Why does SAC converge faster in most continuous environments?

Because:

  • it is off-policy –> reuses old data,
  • has two critics –> more stable Q estimates,
  • uses maximum entropy –> explores intelligently,
  • uses soft updates –> no stability loss,
  • can simultaneously learn the actor and critics on the same batch.

SAC combines stability with efficiency, exactly what PPO cannot do.

6.2 Practical lessons you can apply immediately

  • if you have continuous actions –> use SAC as your first option,
  • if the reward is noisy –> SAC is much more stable,
  • if you have few timesteps available –> SAC is superior (sample-efficient),
  • if you want a robotic prototype –> SAC + SB3 is the best combination,
  • if PPO learns hard –> try SAC immediately.

6.3 Python script to train SAC with SB3 in OpenAI Gymnasium

Before showing the complete SAC training script, it is important to understand what this small piece of code actually does.

Stable-Baselines3 hides all the complexity of SAC behind a clean and simple API:

  • SAC automatically builds the actor, both critics, and the target critics
  • it manages the replay buffer, the entropy tuning and the soft updates
  • it samples batches, updates the networks, and keeps everything stable

With just a few lines of code you get a fully working SAC agent that can train, save, load and run a demo.

Below is the exact training script I used for Pendulum-v1.


"""
SAC Training and Demo Script for Pendulum-v1
--------------------------------------------
Train a SAC agent (no rendering) or run a graphical demo.

USAGE:

# Train SAC
python sac_pendulum.py --train --timesteps 200000

# Demo
python sac_pendulum.py --demo --model <path>

Author: Calin Dragos George
Updated: 23 November 2025
"""

import argparse
import os
import time
import gymnasium as gym
import numpy as np
import torch

from stable_baselines3 import SAC
from stable_baselines3.common.monitor import Monitor


# ---------------------------------------------------------
# Set seeds
# ---------------------------------------------------------
def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)


# ---------------------------------------------------------
# Create Pendulum Environment
# ---------------------------------------------------------
def make_env(render_mode=None):
    # render_mode=None → training (no graphics)
    # render_mode="human" → demo (graphics)
    env = gym.make("Pendulum-v1", render_mode=render_mode)
    env = Monitor(env)
    return env


# ---------------------------------------------------------
# Create SAC Model for Pendulum
# ---------------------------------------------------------
def create_sac(env, lr, log_dir):

    model = SAC(
        "MlpPolicy",
        env,
        learning_rate=lr,
        buffer_size=100000,
        batch_size=256,
        gamma=0.99,
        tau=0.005,
        train_freq=1,
        gradient_steps=1,
        ent_coef="auto",
        verbose=1,
        tensorboard_log=log_dir,
    )

    return model


# ---------------------------------------------------------
# Training Function
# ---------------------------------------------------------
def train_sac(lr, timesteps, seed):

    set_seed(seed)
    env = make_env(render_mode=None)

    timestamp = time.strftime("%Y%m%d-%H%M%S")
    log_dir = f"logs/SAC_Pendulum_lr{lr}_seed{seed}_{timestamp}"
    os.makedirs(log_dir, exist_ok=True)

    print(f"\n Training SAC on Pendulum-v1")
    print(f"→ Learning Rate: {lr}")
    print(f"→ Seed: {seed}")
    print(f"→ Timesteps: {timesteps}")
    print(f"→ Logs: {log_dir}\n")

    model = create_sac(env, lr, log_dir)
    model.learn(total_timesteps=timesteps, progress_bar=True)

    model_path = os.path.join(log_dir, f"SAC_Pendulum_lr{lr}_seed{seed}.zip")
    model.save(model_path)

    print(f"\n Model saved to: {model_path}\n")
    env.close()


# ---------------------------------------------------------
# Demo Function (with graphics)
# ---------------------------------------------------------
def run_demo(model_path, episodes):

    if not os.path.exists(model_path):
        print(f"\n Model not found: {model_path}\n")
        return

    print(f"\n Running SAC Demo: {model_path}\n")

    env = make_env(render_mode="human")
    model = SAC.load(model_path)

    for ep in range(episodes):
        obs, _ = env.reset()
        done = False
        ep_reward = 0

        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = env.step(action)
            ep_reward += reward
            done = terminated or truncated

        print(f"Episode {ep + 1} Reward = {ep_reward}")

    env.close()


# ---------------------------------------------------------
# CLI
# ---------------------------------------------------------
if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument("--train", action="store_true")
    parser.add_argument("--demo", action="store_true")

    parser.add_argument("--lr", type=float, default=3e-4)
    parser.add_argument("--timesteps", type=int, default=200000)
    parser.add_argument("--seed", type=int, default=1)

    parser.add_argument("--model", type=str, default=None)
    parser.add_argument("--episodes", type=int, default=10)

    args = parser.parse_args()

    if args.train:
        train_sac(args.lr, args.timesteps, args.seed)

    elif args.demo:
        if args.model is None:
            print("\n ERROR: missing --model path\n")
        else:
            run_demo(args.model, args.episodes)

    else:
        print("\n Please specify --train or --demo.\n")

SAC Demo Pendulum-v1

SAC Demo Pendulum-v1

7. DEBUGGING SAC: WHAT TO CHECK WHEN YOUR AGENT DOESN’T LEARN

7.1 How do you know if the entropy is too high or too low?

If the entropy is too high:

  • ent_coef is high (e.g. > 0.2),
  • actions seem completely random,
  • reward does not increase at all,
  • actor never becomes sure about actions.

The case when the entropy is too low:

  • ent_coef drops almost to 0 immediately,
  • agent stops exploring,
  • just learns bad behavior and repeats it,
  • reward gets stuck.

Simple rule:

  • At the beginning –> high entropy,
  • After learning –> low entropy.

7.2 How do you detect if actions are saturated (clipping)?

Check if actions are always -1 or +1.

Signs:

  • in the console you see actions like: [1.0] [1.0] [1.0],
  • the pendulum moves violently or only in one direction,
  • the reward oscillates or decreases.

Common causes:

  • learning rate too high,
  • unstable critical,
  • entropy too high,
  • wrong normalization of actions.

7.3 How do you check if the environment reward destroys learning?

Clear signs:

  • the reward remains almost constant (e.g. around -1300 in Pendulum),
  • the reward is always negative and there are no “hints” about the learning direction,
  • a small change in policy produces a very variable reward.

Quick test:

obs, _ = env.reset()
for _ in range(10):
action = env.action_space.sample()
print(env.step(action)[1]) # print reward

If the rewards are completely chaotic –> problematic environment.

7.4 What are the signs that critics are not learning (overestimation / underestimation)?

Critical overestimation (Q-values ​​too high):

  • critical_loss continuously increases (> 5–10)
  • actor becomes unstable
  • entropy increases instead of decreasing
  • reward oscillates violently

Critical underestimation (Q-values ​​too low):

  • actor learns very slowly
  • actions are almost zero
  • reward increases very slowly (e.g. -1300 –> -1200 –> -1180…)

Numerical indicator:

  • critical_loss should be between 0.1 and 2.0 on Pendulum.

7.5 What problems arise due to too small a buffer?

If buffer_size is too small:

  • the agent uses only recent data –> unstable learning,
  • critics “forget” important experiences,
  • reward goes up, down, up again –> “zig-zag learning,”
  • actor learns repeated bad behaviors.

Minimum recommendation for SAC:

  • toy tasks: 100k,
  • robotics: 500k – 2M.

7.6 What bugs occur when you wrongly normalize observations or actions?

Wrong normalization of observations:

  • state becomes NaN or Inf,
  • crit_loss explodes,
  • actor produces random actions,
  • reward stagnates.

Wrong normalization of actions:

  • actions > 1 or < -1 –> clipping –> violent behavior,
  • agent cannot control fine dynamics,
  • actor_loss grows chaotically.

Quick test:

print(obs.min(), obs.max())

If you see incredibly high values ​​–> wrong normalization.

7.7 How do you quickly check if SAC works on another environment?

Simple trick: run SAC on Pendulum-v1.

  • if it learns –> your implementation is correct,
  • if it doesn’t learn –> the problem is with the code, not the environment.

8. CONCLUSION

If there is one thing you should remember from this entire tutorial, it is this:

➡ SAC does not behave like PPO or DDPG, and your agent fails only when you treat it as if it does.

SAC learns differently:

  • it uses two critics instead of one,
  • entropy instead of pure reward maximization,
  • and soft target updates instead of aggressive value fitting.

SAC explores differently:

  • it keeps exploration inside the objective itself,
  • not as an add-on, not as noise,
  • but as a fundamental part of how the agent thinks.

SAC trains differently:

  • it is off-policy,
  • reuses old experiences thousands of times,
  • and reaches good solutions with a fraction of the data PPO needs.

SAC behaves differently:

  • its learning curve is smoother,
  • its Q-values are more stable,
  • and its policy becomes confident at the exact moment entropy decides it should.

And the most important point:

➡ SAC is the most practical algorithm for continuous control – from simple environments like Pendulum to complex robotic systems with real sensors, noise, delays and imperfect physics.

You now understand:

  • why SAC uses two critics,
  • why entropy is essential,
  • why SB3 hides so much behind one line of code,
  • how to interpret SAC’s training logs correctly,
  • which hyperparameters truly matter,
  • why SAC outperforms PPO in continuous actions,
  • how to debug SAC when it fails.

And with the training script you now have, you can:

  • train
  • save
  • load
  • visualize
  • and deploy

a complete SAC agent on Pendulum-v1, or on any other continuous-control environment.

Tags: Evaluation MetricsReplay BufferTensorBoard
ShareTweetShareShareSend
Previous Post

Hands-On: Min-Max Normalization In Action

Next Post

Reinforcement Learning Explained: The Complete Beginner’s Guide to How Machines Learn from Experiences

Related Posts

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows
MuJoCo

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows

March 4, 2026
What is Actor-Critic in Reinforcement Learning?
Deep RL Algorithms

What is Actor-Critic in Reinforcement Learning?

January 20, 2026
Next Post
Reinforcement Learning Explained: The Complete Beginner’s Guide to How Machines Learn from Experiences

Reinforcement Learning Explained: The Complete Beginner’s Guide to How Machines Learn from Experiences

On the surface, they look similar, but "what" they learn and "how" they learn are very different.

Reinforcement Learning: Supervised, Unsupervised, or Something Else? (When to Use Each)

From MDP to POMDP: Why Reinforcement Learning Breaks in Practice

From MDP to POMDP: Why Reinforcement Learning Breaks in Practice

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

About the author

About Dragos Calin

Dragos Calin is a robotics engineer and reinforcement learning practitioner focused on building real-world autonomous and remote-controlled robotics for agriculture, edge-AI robotics, and embedded platforms. His work join simulation, machine learning, and hardware deployment, with a strong emphasis on practical, testable solutions that function outside the lab.

Areas of Expertise:

  • # Reinforcement Learning for Robotics
  • # Autonomous Agricultural Robots
  • # Embedded Systems & Edge AI (Jetson, Raspberry Pi, Arduino)
  • # Robotic Simulation & Sim2Real Workflow
  • # Sensor Fusion & Control Systems
  • # ROS-Based Robotics Development

Tags

Actor-Critic Bellman Equation Evaluation Metrics Exploitation Exploration Hyperparameter Tuning Machine Learning Markov Decision Process MDP MDP (Markov Decision Process) Normalization Partial Observability POMDP Q-Function Replay Buffer Temporal Difference TensorBoard
Newsletter

Subscribe Blog for Latest Updates

To stay updated with our newest projects and tutorials, make sure you subscribe to our newsletter. 

We do not share your information! You can subscribe  at any time. By subscribing you agree to our Privacy Policy.

Stay Tuned – Follow Us

To stay updated with our newest projects and tutorials, make sure you follow us on: Twitter / X

Site Information

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 Reinforcement Learning Path

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
      • CLASSIC DEEP RL APPLICATION
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3

© 2026 Reinforcement Learning Path