This page was last edited on 13 November 2025
How do we solve a real-world problem when the state space is huge (such as robotics, autonomous driving, video games), or even continuous (positions, speeds, images, etc.)?
How do you understand what is in an image?
Deep Q Network (DQN) is an algorithm that allows the agent to learn optimal behavior even when the states cannot be explicitly enumerated.
The classic variant of DQN is Q-learning, an algorithm that works well only when the number of possible states is small. A table of Q values could not store values for millions or billions of states. So DQN comes with a solution:
- it uses a neural network that learns to approximate the function Q(s, a),
- thus, it no longer needs a huge table.
Neural networks are able to process raw data (raw input), without humans manually transforming them into features.
Examples:
- If the input is an image, DQN processes it directly (but it uses a Convolutional Neural Network (CNN) to automatically extract visual features),
- If it is a robot, it can learn directly from sensors or IMU data,
- If it is an agent in a simulator, it can receive numerical telemetry (position, velocity, acceleration).
So, DQN learns directly from the real world, without someone manually telling it what indicators to track.
If you want to see how to implement this algorithm using Gymnasium, PyTorch, and Stable Baselines3, check the full step-by-step tutorial here: Deep Q-Learning Explained – A Step-by-Step Guide to Build, Train, and Visualize Your First DQN Agent with PyTorch, Gymnasium, and Stable Baselines3
The origins of DQN
DeepMind (the British company founded by Demis Hassabis, later bought by Google) introduced DQN in 2013. In 2015, the famous article “Human-level control through deep reinforcement learning” was published, which became a turning point in AI.
In that study:
- the DQN agent learned to play Atari games (such as Breakout, Pong, Space Invaders),
- the agent learned just by looking at the pixels on the screen, without any information about the rules of the game,
- it learned optimal behaviors through trial and error, just as a human would.
This result led to a historic moment in RL. It demonstrated that an agent could learn from raw visual perception alone, without explicit programming.
What type of learning is it?
DQN is value-based learning. It does not directly choose the action, but evaluates each possible action in a state and says how good it is.
Is DQN Model-Free or Model-Based?
DQN is a Model-Free Reinforcement Learning algorithm.
This means it doesn’t try to model the environment’s dynamics — no transition function P(s'|s,a) is learned or used.
Instead, the agent learns directly from experience tuples (state, action, reward, next state), using a neural network to approximate Q-values.
This is ideal when:
- The environment is too complex or noisy to model accurately,
- You prefer real-world trial-and-error learning,
- You want to avoid handcrafted dynamics or simulators.
What is the algorithm trying to compute or optimize?
DQN learns the optimal Q-function -> the expected total reward for taking action a in state s and acting optimally afterward.
It minimizes the difference between the predicted Q-value and the target Q-value.
What is the training loop?
The main loop of DQN looks like this:
- Predict: Estimate Q-values from current state,
- Action: Choose action via epsilon-greedy,
- Feedback: Apply action, observe reward and next state,
- Store: Save transition in replay buffer,
- Sample: Draw a batch of past transitions,
- Compute: Use Bellman equation to calculate target Q,
- Update: Minimize Temporal Difference(TD) error using gradient descent,
- Sync: Occasionally update the target network.
The training loop is repeated over thousands of episodes.
How do you implement DQN?
- Define state and action spaces
- Create Q-network and target network
- Initialize replay buffer and epsilon parameters
- Write reward function for your specific task
- Use
torch.optim.Adamor similar for training - Track performance over episodes
- Save model weights when performance is good
Is DQN on-policy or off-policy?
It learns the best policy regardless of the actions the agent actually took during training.
How does the agent balance exploration vs. exploitation?
DQN is using epsilon-greedy:
- With probability ε, take a random action (exploration),
- Otherwise, choose the action with the highest Q-value (exploitation)
ε decays over time as the agent gains more experience.
Under what conditions DQN converge?
- All state-action pairs are explored,
- The learning rate decays properly,
- The replay buffer is diverse,
- The environment is stationary,
- The target network is updated regularly.
Where does it struggle?
- Continuous action spaces,
- Sparse or delayed rewards,
- Environments with partial observability,
- Fast-changing environments,
- Real-time control with strict timing.
What kind of problems is it good for?
DQN works well in environments with discrete action spaces and clear rewards.
Applications across industries:
Games
- Atari games,
- Gridworlds,
- Board games with discrete moves.
Robotics
- Discrete motor control,
- Obstacle avoidance with LIDAR,
- Line following,
- Robotic arm with discrete joint movements.
Autonomous Vehicles
- Path planning with grid maps,
- Lane switching,
- Discrete throttle/brake/steering options.
Agriculture
- Robot navigation between rows,
- Positioning for planting or harvesting,
- Task scheduling with limited action options.
Manufacturing / Automation
- Task selection,
- Discrete machine operations,
- Assembly sequences.
Telecom & Networks
- Resource allocation,
- Discrete packet routing,
- Power level selection.
Smart Energy
- HVAC control (on/off or fixed levels),
- Grid power balancing,
- Load shedding.
Finance
- Portfolio rebalancing (discrete actions),
- Order execution timing.
Common traps and mistakes
- No target network → leads to unstable learning,
- Replay buffer too small or lacks diversity,
- Epsilon doesn’t decay → agent stays random,
- Wrong reward design → agent learns unintended behavior,
- No normalization → gradients explode,
- Too few training episodes → underfitting.
DQN Equation
In the specialized literature, DQN does not have a single equation. It is defined by two fundamental relations:
- Bellman Equation (target value / target)
- Loss Equation (loss function)
- Bellman Optimality Target
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{4mm} \\ \displaystyle y_t = r_t + \gamma \, \max_{a'} Q(s_{t+1}, a'; \, \theta^{-}) \vspace{4mm} \\ \end{array} } \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-e9a1977806f40a588f27053006f45bfe_l3.png)
Where:
- yt: Target Q-value -> The target value that the DQN algorithm tries to predict. It represents the “ideal” Q-value for the current state and action.
rt: Immediate reward -> The reward received after taking action at in state st. It measures how good that specific action was in the short term.- γ: Discount factor -> It is a constant between 0 and 1 that determines how much future rewards matter.
- maxa′: Maximum over next actions -> It selects the best possible action a′ in the next state st+1, based on the highest predicted Q-value.
- Q(st+1,a′;θ−): Target network’s Q-value estimate -> The Q-value predicted by the target network for the next state and action.
- st+1: Next state -> It is the state observed after executing the action at from the current state st.
- a′: Next possible action -> Any valid action the agent could take in the next state st+1.
- θ−: Parameters of the target network -> The weights of the Q-network.
2. Loss Function (Mean-Squared Temporal Difference Error)
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{4mm} \\ \displaystyle \mathcal{L}(\theta) = \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim \mathcal{D}} \Bigl[ \bigl( y_t - Q(s_t, a_t; \theta) \bigr)^2 \Bigr] \vspace{4mm} \\ \end{array} } \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-4ad507b4bee2b13576899d14269b7183_l3.png)
Where:
- L(θ): Loss function -> Measures how far the Q-network’s predictions are from the target values yt.
- E[⋅]: Expectation operator -> The average error over many sampled transitions.
- (st,at,rt,st+1)∼D: Experience sampling from replay buffer -> The transitions (state, action, reward, next state) which are sampled randomly from the replay buffer D (stores past experiences).
- D: Replay buffer (Experience Replay) -> A memory that stores previous interactions (s,a,r,s′).
- yt: Bellman target -> The target Q-value.
- Q(st,at;θ): Predicted Q-value -> The current Q-value predicted by the online network with parameters θ, for the chosen action at in state st.
- (yt−Q(st,at;θ))2: Squared Temporal Difference (TD) Error -> Measures how far the predicted Q-value is from the Bellman target.
- θ: Network parameters -> The weights of the online Q-network.
3. This is the “expanded” version of the equation, exactly as it appears in the DeepMind papers
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{4mm} \\ \displaystyle \mathcal{L}(\theta) = \mathbb{E} \Bigl[ \bigl( r_t + \gamma \, \max_{a'} Q(s_{t+1}, a'; \, \theta^{-}) - Q(s_t, a_t; \, \theta) \bigr)^2 \Bigr] \vspace{4mm} \\ \end{array} } \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-0746bf9dbd5262bbb4ddea0422b9a9db_l3.png)
Further reading
References:
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … & Hassabis, D. (2015). Human-level control through deep reinforcement learning
- Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
- OpenAI Gym documentation.
Q-Learning << Previous | Next >> Double DQN