This page was last edited on 15 May 2025
Dueling DQN is a variant of Deep Q-Networks.
Instead of learning just Q-values, it splits the Q-function into two parts:
- Value function
V(s)– how good is the state - Advantage function
A(s, a)– how good is a specific action in that state
Then it combines them into Q(s, a).
Why Choose Dueling DQN?
In many states, the choice of action doesn’t matter much.
Dueling DQN helps by separately learning:
- How valuable the state is
- Which actions are better in that state
This improves learning efficiency and policy quality, especially in states with many similar-valued actions.
What Type of Learning Is It?
Value-based deep reinforcement learning. It estimates the Q-value for every action, using a shared network.
Model-Free or Model-Based?
Model-Free.
It learns directly from experience, no model of the environment.
What Is It Trying to Compute?
It tries to compute the optimal Q-function, same as DQN:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(s, a) = V(s) + A(s, a) - \text{mean}_a A(s, a) \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-8629b98a6e99eb7027d7b6b8b180afc8_l3.png)
Where:
- Q(s, a) – estimated value of taking action a in state s.
- V(s) – value of being in state s, regardless of which action is taken.
- A(s, a) – advantage of taking action a in state s; it tells how much better that action is compared to the average.
- meanₐ A(s, a) – the average advantage over all possible actions in state s, used to normalize the advantages.
The subtraction of mean(A) keeps the Q-values centered and stable.
Training Loop
Same high-level flow as DQN / Double DQN:
- Collect experience
(s, a, r, s', done) - Store in replay buffer
- Sample mini-batch
- Compute Double DQN-style target
- Compute
Q(s, a)using dueling architecture - Compute loss between predicted and target Q
- Backpropagate and update
- Update target network (soft or hard copy)
How to Implement Dueling DQN?
Instead of outputting Q-values directly, the network has two output heads:
- One for
V(s)– a single scalar - One for
A(s, a)– a vector for all actions
Then combine:
Q = V + (A - A.mean(dim=1, keepdim=True))
You can use this in both DQN and Double DQN.
On-Policy or Off-Policy?
Off-Policy.
Learns from replayed past experiences, not just the current policy.
Exploration vs. Exploitation?
Uses standard ε-greedy exploration:
- With probability ε → random action
- Otherwise → action with max Q
ε is decayed over time.
When Does It Converge?
- With stable updates (small learning rate, replay buffer, target network)
- With enough exploration
- Works better when action effects are subtle but state value matters
Where Does It Struggle?
- In very small action spaces, the benefit is minimal
- Poor implementation of the architecture (e.g., wrong normalization)
- If
AdominatesV, training becomes unstable
What Problems Is It Good For?
- Large action spaces
- Visual or noisy inputs (Atari, image-based input)
- When state value is more important than the exact action choice
Common Traps and Mistakes
- Forgetting to normalize the advantage term → unstable Q-values
- Not using it with Double DQN → more overestimation
- Confusing architecture with Actor-Critic (this is still Q-learning)
Dueling DQN Equation
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a') \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-8b7e5724706d939496bf68f90bf2fb5b_l3.png)
Where:
Q(s, a)– estimated Q-valueV(s)– value of being in statesA(s, a)– advantage of actionain statesmean_a A(s, a)– average advantage over all actions (to avoid redundancy)
This ensures the Q-values are identifiable – meaning the network can learn something useful, not just shift the values arbitrarily.
References:
- Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2016). Dueling Network Architectures for Deep Reinforcement Learning. Proceedings of the 33rd International Conference on Machine Learning (ICML), 48, 1995–2003.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
- OpenAI Gym documentation.
Double DQN << Previous | Next >> Proximal Policy Optimization (PPO)