This page was last edited on 17 June 2025
Soft Actor-Critic (SAC) is a model-free, off-policy, actor-critic algorithm that optimizes both reward and exploration. It adds entropy to the objective to encourage diverse behavior. SAC is known for sample efficiency, stability, and strong performance in continuous control tasks.
Why Choose SAC?
- Stable training with entropy regularization
- Efficient: reuses data (off-policy)
- Handles continuous action spaces naturally
- Performs well even in high-dimensional environments
What Type of Learning Is It?
SAC uses reinforcement learning with stochastic policy gradients. The agent learns by interacting with the environment and maximizing expected return plus entropy.
Model-Free or Model-Based?
Model-Free.
SAC does not learn a transition model. It directly optimizes the policy and value functions using collected experience.
What Is It Trying to Compute?
SAC tries to maximize expected cumulative reward plus entropy:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \pi^* = \arg\max_{\pi} \sum_t \mathbb{E}_{(s_t, a_t) \sim \rho_\pi} \left[ r(s_t, a_t) + \alpha \cdot \mathcal{H}\left( \pi(\cdot \mid s_t) \right) \right] \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-59469c1d553917250ab8266c77936624_l3.png)
Where:
- r(st,at): reward
- H: entropy of policy
- α: entropy coefficient (trade-off control)
Training Loop (Simplified)
- Sample action from current policy
- Execute action, observe reward and next state
- Store transition in replay buffer
- Sample batch and update:
- Critic using Bellman backup
- Actor via stochastic gradient
- Entropy coefficient (automatically or fixed)
On-Policy or Off-Policy?
Off-Policy
It uses a replay buffer and can learn from past data, improving sample efficiency.
Exploration vs. Exploitation?
SAC explicitly encourages exploration through entropy maximization. The agent learns a stochastic policy that maintains randomness unless more deterministic behavior is more rewarding.
When Does It Converge?
- Typically faster than on-policy algorithms (e.g., PPO)
- Converges when:
- Critic loss stabilizes
- Policy entropy settles
- Reward plateaus across episodes
Where Does It Struggle?
- Environments with sparse rewards
- Highly discrete action spaces
- Entropy tuning can destabilize training if not handled properly
What Problems Is It Good For?
- Robotic control (UR5, grippers, drones)
- Continuous action tasks (e.g., MuJoCo, Isaac Sim)
- Complex environments where exploration is critical
Common Traps and Mistakes
- Forgetting to tune or anneal entropy
- Using too small replay buffer
- Not using target smoothing in critic updates
- Forgetting to clip gradients or stabilize losses
SAC Equation and Parameters
Policy Objective (Entropy Regularized):
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle J_{\pi} = \mathbb{E}_{s_t \sim \mathcal{D}} \left[ \mathbb{E}_{a_t \sim \pi_\theta} \left[ \alpha \log \pi_\theta(a_t \mid s_t) - Q_\phi(s_t, a_t) \right] \right] \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-8d6e3922c2b28859323c654048179e68_l3.png)
Critic Loss:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle J_Q = \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim \mathcal{D}} \left[ \left( Q_\phi(s_t, a_t) - y_t \right)^2 \right] \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-a3d095b5295c6fb3818357c1b0960e98_l3.png)
where:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle y_t = r_t + \gamma \cdot \mathbb{E}_{a_{t+1} \sim \pi_{\bar{\phi}}} \left[ Q_{\bar{\phi}}(s_{t+1}, a_{t+1}) - \alpha \log \pi(a_{t+1} \mid s_{t+1}) \right] \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-6815d7ec20dd6d44f9bb7560da71b00f_l3.png)
Parameters Explained:
- α: Entropy coefficient – controls exploration-exploitation trade-off
- πθ: Stochastic policy network (actor)
- Qϕ: Critic network estimating value of state-action pairs
- γ: Discount factor (future reward weight)
- D: Replay buffer
- ϕˉ: Target network parameters (soft-updated from
References:
- Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
- Achiam, J. (2018). SAC Explained. OpenAI Spinning Up.
Proximal Policy Optimization (PPO) << Previous | Next >> Deep RL Application with DQN and CNN