This page was last edited on 14 October 2025
Q-learning solves decision-making problems in environments where an agent must learn what to do by trial and error. It teaches the agent how to act optimally over time.
Q = “Quality” of an action taken in a state.
Q-learning became one of the first model-free Reinforcement Learning(RL) methods. It laid the foundation for Deep Q Learning later used in Atari and robotics.
The Q-Learning equation was not developed all at once. It was gradually built on mathematical and algorithmic foundations from decision theory and reinforcement learning.
Its updating formula is based on the Markov Decision Process. This process represents the general framework. The Bellman equation is used for the recursion of optimal values and Temporal difference (TD) Learning for the model-free updating mechanism. Historically, it all starts with MDP and Bellman in the 1950s. In the 1980s, Temporal Difference (TD) followed. And in 1989, Q-Learning appeared.
What type of learning is Q-Learning?
Q-learning is value-based reinforcement learning.
It focuses on learning the value of actions, not the policy directly.
What is the algorithm type?
Q-learning is off-policy. It learns the optimal policy independently of the actions actually taken during learning.
How does the agent balance exploration vs exploitation?
Using epsilon-greedy strategy:
- With probability ε → explore (random action)
- With 1−ε → exploit (choose best known action)
Does the algorithm converge? Under what conditions?
Q-learning converges if:
- All state-action pairs are visited infinitely often
- Learning rate decays properly
- Environment is stationary
Where does it struggle?
It struggles with:
- Large or continuous state spaces (Q-table becomes huge)
- No generalization between similar states. That’s where deep q learning comes in
What kind of problems is it good for?
Q-learning in reinforcement learning is great for:
- Discrete environments
- Grid worlds
- Simple robotics
- Control tasks with a limited number of actions
What other algorithms build on top of it?
- Deep Q-Learning (DQN)
- Double Q-Learning
- Dueling DQN
- Prioritized Experience Replay
- Rainbow DQN
These add stability, performance, and scalability.
Q-Learning Formula and Explanation
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(s, a) \leftarrow Q(s, a) + \alpha \cdot \left[ r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a) \right] \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-3d113471b49a2e23b40e9e9f1df50ffb_l3.png)
Where:
- Q(s,a): current Q value for the state-action pair
- α: learning rate
- γ: discount factor
- r: reward
- s′: next state
- a′: next action
- maxa′Q(s′,a′): the maximum estimated value for the next state, according to the “greedy” principle of Q-Learning
This is the core of the q learning formula. Below is the explanation of each parameter in the equation.
1. Q(s,a): current Q value for the state-action pair
- it is the current estimate of “how good” action a is in state s, based on previous experiences. It is both the input (old value) and the output (updated value) of the formula.
- it stores and accumulates knowledge about the environment. Without a persistent estimate, the agent would forget everything after each experience. Just like an amnesiac.
- if Q(s, a) is initially too high (overestimated), updating it will decrease it towards the real value, accelerating convergence to an optimal policy. If it is too low, it will increase.
- it directly influences the policy. The agent will choose actions with maximum Q, so a wrong estimate leads to temporary bad decisions.
Analogy: Think of a travel diary where you record the average score of a restaurant (Q). At each visit, you update the score based on the new experience. At the same time, you take into account the old score that influences how much the past “weighs”.
2. α: learning rate
- it is a scaling factor (between 0 and 1) that decides how much new experience influences the old estimate.
- it has the role of balancing between exploiting old knowledge (stability) and adapting to new data (flexibility). Without it, a single experience would rewrite everything, or nothing would change.
- this parameter influences the result as follows: a large α (e.g. 0.9) makes learning fast, but unstable (oscillations, slow convergence to the optimum). A small α (e.g. 0.1) makes it slow, but stable, avoiding noise. Over time, decreasing α helps the final convergence.
Analogy: Like the volume of a radio: a large α means that a new song (experience) completely covers the old one, changing the playlist abruptly. A small α adds it subtly, keeping the balance.
3. γ: discount factor
- it is a number between 0 and 1 that weights the importance of future rewards in the target.
- it serves to prioritize immediate rewards over distant ones, modeling uncertainty or time costs. Without γ, infinite amounts of rewards would diverge.
- if γ is close to 1 (e.g. 0.99) it makes the agent visionary, optimizing in the long term (good for complex games). A small γ (e.g. 0.1) makes it short-sighted, focused on quick wins, but risking long-term pitfalls. It affects the decision “horizon”.
Analogy: As an investment: γ=1 means that a dollar in 10 years is worth the same as today (zero risk); γ=0.5 makes it worth only 0.5^10 ≈ 0.001 dollars, forcing quick investments.
4. r: reward
- is the direct feedback from the environment for the action taken. It is a scalar number (positive for good, negative for bad, zero for neutral).
- this parameter has the role of guiding the agent towards useful actions. Without reward, the agent would not know what is “good” or “bad” in the environment. It is like a child learning without praise or punishment.
- r influences the outcome as follows. A large r rapidly increases Q(s,a), making the action attractive and accelerating learning towards immediate goals. A small or negative r discourages it. It influences the learning trajectory: rare rewards (e.g. only at the end) make learning slow, but promote long-term strategies.
Analogy: Like a monthly salary in a career: a large bonus (r +100) motivates you to repeat the job, but a small salary (r -10) makes you look for something else, adjusting the “value” of your estimated career.
5. maxa′Q(s′,a′): the maximum estimated value for the next state
- its role is to choose the best anticipated option in the future state s’. This is computed as the maximum over all possible actions a’. It is the “future” part of the update target.
- it exists to incorporate long-term planning. Without it, the agent would only learn from immediate rewards, ignoring later consequences (myopia).
- it influences the outcome because it makes Q reflect not only the present, but also the optimistic future, leading to smarter policies. If future estimates are optimistic (large Q), it learns aggressively. If pessimistic, it becomes conservative. It is the off-policy key: it allows learning an optimal policy without strictly following it.
Analogy: Like a GPS that shows not just the immediate distance, but the fastest overall route – max Q is the “minimum estimated time” to the next intersection, guiding the current decision.
Further reading
If you want to understand better how the Q-values are updated in Q-Learning, follow this tutorial: Step-by-Step Tutorial: Q-Learning Example with CartPole.
References:
- Watkins, C. J. C. H., & Dayan, P. (1992). Q-Learning. Machine Learning. Springer.
- Chadi, M. A., & Mousannif, H. (2023). Understanding Reinforcement Learning Algorithms: The Progress from Basic Q-Learning to Proximal Policy Optimization.
- Simplilearn. (2025). Q-Learning Explained: Learn Reinforcement Learning Basics.