In Reinforcement Learning(RL), choosing the right value for the discount factor γ is one of the most underestimated decisions. This parameter was first introduced in the Bellman equation. It controls the balance between immediate and future rewards. The wrong choice can lead to slow learning, lack of convergence, or unexpected behaviors — even if the algorithm is identical.
The main goal of this tutorial is to clarify the role and impact of the discount factor γ (gamma) in Reinforcement Learning(RL). I’m using a real-world case study with Q-Learning and CartPole. I explore, through experimental and visual demonstrations, how different values of γ(gamma) can completely transform the agent’s behavior. It starts from being short-sighted and unstable, then turns into one that plans for the long term, or even becomes divergent.
In this tutorial you will learn:
- how γ affects convergence and stability in Q-Learning,
- how to choose the right value for your own RL environment,
- and what happens when γ exceeds the recommended limits (for example, γ > 1.0) and why the algorithm crashes.
To further understand the role of this factor in the Bellman and Q-Learning equations, I recommend you to read these two articles:
Table of Content
- The Origins and Meaning of the Discount Factor (γ)
- What is the Discount Factor?
- Where the Discount Factor Appears in the Q-Learning Update Rule
- Environment: CartPole + Q-Learning
- Step-by-Step Example: How the Discount Factor Affects Q-Learning
- Comparative Analysis: How Different γ Values Affect Learning and Convergence
- Practical Insights, Pitfalls & Debugging
- Conclusion
- Download & Reproduce the Experiments (GitHub Link)
The Origins and Meaning of the Discount Factor (γ)
The concept of discounting has its roots in 19th-century economics. It comes from the time-preference theory, developed by economists such as Irving Fisher. It was used to model the idea that future rewards are worth less than immediate ones. This prevents the divergence of infinite sums in long-term or infinite-horizon models.
In the 1950s, the discount factor was adopted into optimal control and dynamic programming through the Bellman equation. If gamma (γ) < 1, it ensures the convergence of values by decomposing problems into recursive subproblems. In infinite decision-making processes (infinite-horizon MDPs), without γ the sum of returns can diverge, making the optimization undefined.
For robotics — for example, autonomous navigation in large environments such as orchards or fields — this factor helps us decide how much the future matters when the agent plans its actions.
What is the Discount Factor?
The discount factor determines how much an agent values future rewards compared to immediate ones. It is a number between 0 and 1 that defines the agent’s time horizon. In other words, how far into the future the agent looks when making decisions.
Formally, it appears in the return equation:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-45ef8d333c696cc8124a586e49dade12_l3.png)
Where:
- Gt: total discounted return,
- Rt+1,Rt+2,Rt+3,…: future rewards,
- γ: discount factor,
- γ2,γ3,…: the power of the γ factor.
This equation tells that each future reward is multiplied by γ raised to the number of steps ahead. This is making distant rewards to contribute less to the total value.
- When γ = 0, the agent is short-sighted and focuses only on immediate rewards,
- when γ → 1, it becomes long-sighted, and the agent is planning for long-term benefits,
- in episodic environments, γ can safely be 1 because the episode ends,
- in continuous or infinite-horizon tasks, γ must be < 1 to keep the total return finite.
Before moving on to the next section, let’s better understand the list above.
If γ = 0, the agent does not see into the future. It is focused on the present.
If γ → 1, the agent focuses on a long-term plan.
So far everything should be clear. But the next two points have nuances that should be explained in more detail.
An episodic environment is an environment in which each episode has a finite duration and ends clearly at a specific time. For example, an episode of CartPole ends when the bar falls or after 500 steps. This means that the reward chain has a finite length (T). Therefore, we can set γ = 1 without the risk of divergence. When an episode is finite, the sum of the rewards is finite even without discounting.
However, even if it is mathematically safe, in practice γ = 1 can create numerical instability or large variations in Q values. For example, in noisy environments, a γ = 1 amplifies future uncertainty. Therefore, in many RL applications (including episodic ones), a slightly sub-unit γ is preferred — for example, 0.98 or 0.99 — to stabilize learning and reduce the effect of noise.
A continuous (or infinite-horizon) environment is an environment that has no natural end. That is, the agent acts indefinitely. For example, a temperature controller that operates continuously. In such environments, there is no time T at which the episode stops. So the sum of future rewards can have infinitely many terms. This means that γ < 1 damps the contribution of distant rewards, making the total sum of returns stable and convergent. Therefore, we use values such as γ = 0.95–0.99.
Where the Discount Factor Appears in the Q-Learning Update Rule
Q-Learning is an off-policy reinforcement learning algorithm. This means that it learns by estimating the value of the optimal policy (greedy), even if during training the actions are chosen exploratory (ε-greedy).
Its goal is to learn a function Q(s,a) that estimates the “value” of an action in a state:
- how much total reward (return) the agent can get if it chooses that action and then follows an optimal policy.
Unlike model-based methods, Q-Learning is model-free. It does not need to know the dynamics of the environment (state transitions).
Q-Learning is a TD-learning (Temporal Difference) algorithm. It updates the values after each state transition using the Bellman equation as a “learning target”.
Q-Learning Formula
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(s, a) \leftarrow Q(s, a) + \alpha \cdot \left[ r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a) \right] \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-3d113471b49a2e23b40e9e9f1df50ffb_l3.png)
Where:
- Q(s,a): current Q value for the state-action pair
- α: learning rate
- γ: discount factor
- r: reward
- s′: next state
- a′: next action
- maxa′Q(s′,a′): the maximum estimated value for the next state, according to the “greedy” principle of Q-Learning
Where does discount factor γ come into play in the learning process?

- Initialize environment: define the states, actions, rewards, and rules of the environment,
- initialize Q-table: create a matrix (or other structure) Q(s,a), usually with zero initial values,
- set α, γ, ε: set the hyper-parameters: learning rate α, discount factor γ, and exploration rate ε. This is where γ comes in,
- for each episode: repeat the learning process for a number of episodes,
- reset → s: Reset the environment to the initial state and set the current state s,
- choose action (ε-greedy): with probability ε you explore randomly, otherwise you choose the action with maximum Q(s,a),
- execute → (s′, r, done): the agent executes the action, receives the reward r, reaches a new state s′; check if the episode is over,
- update Q with TD rule -> see the above equation: here γ appears explicitly as a factor that “discounts” the future value,
- s ← s′: update the current state for the next step of the episode,
- decay ε (optional): reduces the exploration rate over time to favor long-term exploitation (not strictly necessary but very useful),
- elementary: terminate when you have enough episodes or when the agent converges.
Environment: CartPole + Q-Learning
For this tutorial, I use the CartPole-v1 environment from Gymnasium to demonstrate how the discount factor γ affects learning in Q-Learning.
CartPole is a simple environment and ready to use in RL. The task is to apply left or right forces to balance a pole on a moving cart.
Despite its simplicity, it captures the essence of control and balance problems. It is exactly the situation where the choice of γ makes a big difference.
The Q-Learning agent interacts with the CartPole environment step by step, learning a policy that maximizes the cumulative reward.
If you haven’t installed Gymnasium yet, or want to test your CartPole setup, you can follow this short tutorial: How to Install OpenAI Gymnasium in Windows and Launch Your First Python RL Environment
In the tutorial, I explain how to install the environment, verify your Python setup, and run the CartPole environment.
Step-by-Step Example: How the Discount Factor Affects Q-Learning
All experiments follow the same protocol: same environment, same α, ε, and same number of episodes(α = 0.1, ε-decay = 0.995, episodes = 5000, environment = CartPole-v1). The only variable is γ.
During training, I monitor the episode reward curve to see how learning evolves. A growing and stabilizing trend indicates that the agent is learning an effective policy. If the curve remains flat or oscillates heavily, it means the agent hasn’t yet found a consistent balance strategy.
EXAMPLE 1: Discount Factor Gamma (γ) = {0.100}
For this example, the discount factor is fixed at γ = 0.100.
A low γ value reduces the contribution of future rewards. In this case, the agent prioritizes short-term gains.

As we can see in the plot, using a very small discount factor (γ = 0.100) makes the agent focus almost entirely on immediate rewards while ignoring long-term outcomes.
The episode reward increases quickly at the beginning, but afterward the learning process becomes unstable and inconsistent.
Around the middle of training, performance drops significantly and recovers later , but only partially. This happens because the agent fails to plan ahead. It learns local reactions that work for short-term balance but cannot maintain stability over time. The result is a short-sighted and oscillating behavior, typical of agents with very low γ.
EXAMPLE 2: Discount Factor Gamma (γ) = {0.900}
With a discount factor of γ = 0.900, the agent should start to consider not only the immediate reward but also short-term future outcomes.

As seen in the plot, the reward grows steadily during the first 1,000 episodes and then stabilizes around 110–120.
This is an indication that tells us the agent struggles to maintain stability over long horizons. It learned a reasonably good balancing strategy within the first 1000 episodes.
EXAMPLE 3: Discount Factor Gamma (γ) = {0.950}
With a discount factor of γ = 0.950, the agent should value future rewards more than before. This approach leads to a smoother and more stable learning curve.

The performance quickly rises during the first 750 episodes and then stabilizes around 110–120 rewards per episode.
Compared to γ = 0.9, the improvement is subtle but visible. The learning curve is steadier. The agent’s behavior becomes less reactive and more planned.
EXAMPLE 4: Discount Factor Gamma (γ) = {0.990}
With a discount factor of γ = 0.990, the agent should learn to plan further into the future and optimize long-term rewards.

Compared to smaller gamma values, this example shows how the reward increases steadily and becomes much smoother. After about 2 000 episodes, the learning curve stabilizes around 130–140 rewards per episode. The agent has learned a stable balancing strategy.
This behavior reflects the advantage of higher γ values. It learns slower but more consistently. The agent anticipates future consequences instead of reacting only to immediate outcomes.
EXAMPLE 5: Discount Factor Gamma (γ) = {0.995}
With a discount factor of γ = 0.995, the agent should give strong importance to long-term rewards.

At the beginning, the learning process becomes slower, but the improvement is steady and consistent.
The curve shows a smooth rise up to around 150–160 rewards per episode, indicating a stable and well-balanced policy.
Compared to smaller γ values, the agent becomes more patient and less reactive. It focuses on maintaining balance over a longer horizon rather than maximizing short-term gains.
This results in slower learning but more stable long-term performance, which is ideal for control tasks that require persistence and stability.
EXAMPLE 6: Discount Factor Gamma (γ) = {0.998}
With γ = 0.998, the agent should give almost equal importance to future and immediate rewards.

The learning curve starts slowly but then rises sharply, stabilizing around 180–200 rewards per episode.
This shows that the agent has learned to maintain balance for long periods. It is not focused on long-term stability rather than short-term gains.
This high γ value promotes patient, forward-looking decision-making — a key trait of robust control in real-world RL systems.
EXAMPLE 7: Discount Factor Gamma (γ) = {1.100}
When the discount factor exceeds 1.0, the learning process should become unstable and diverges.

In this experiment with γ = 1.100, the agent fails to converge. The rewards fluctuate wildly between high and low values, and the average reward stays around 50.
The reason is mathematical. An γ greater than 1 amplifies future rewards instead of discounting them. This causes the Q-values to grow without bound, breaking the balance between present and future estimation.
As a result, the agent never stabilizes its policy and keeps oscillating between random and inconsistent actions. This clearly shows why γ must always be ≤ 1 for convergence in Q-Learning and other value-based algorithms.
Comparative Analysis: How Different γ Values Affect Learning and Convergence

In this comparative plot, it’s exactly what I wanted to demonstrate in this tutorial: γ controls the agent’s personality. It’s not just “a parameter“. It changes its learning speed, stability, and the maximum level it can reach in 5,000 episodes.
- Low γ (0.1) → short-sighted RL. Learns fast at the beginning, but never reaches high rewards. Good for tasks with immediate feedback, bad for balance/control.
- Medium γ (0.9–0.95) → fast and reasonably stable. These values reach a decent performance quickly. They are good when you have a few episodes or when the environment is noisy.
- High γ (0.99–0.998) → slow but best final performance. These agents need more steps to learn, but they end up with the most stable policy. This is what you want for robotics / control.
- γ > 1.0 → instability / divergence. Even if some episodes look good, the average reward stays low. This is a good negative example.
γ = 0.9 performed better than γ = 0.95 not because “shorter future is better”, but because my hyperparameters (α, discretization, ε-decay) were better aligned with that time horizon. A higher γ does not automatically mean better learning. The rest of the pipeline must be aligned with it.
In this controlled experiment (same environment, same α, same ε, same number of episodes), the best overall performance was obtained with γ = 0.995–0.998. This confirms that higher discount factors are better suited for CartPole-v1 when the goal is long-term stability rather than fast early learning.
Practical Insights, Pitfalls & Debugging
How to Detect When γ Is Too Low
Symptom: a fast early learning is followed by stagnation.
Plot pattern: reward increases quickly in the first few hundred episodes, then stops improving.
Agent behavior: the pole oscillates left–right without stabilizing for long.
Explanation:
- A very small γ (e.g. 0.1 or 0.3) makes the agent ignore delayed rewards. It optimizes only the next move instead of planning ahead.
Fix:
- You have to increase γ gradually (0.7 → 0.9 → 0.95) and monitor the slope of the reward curve. If improvement slows down but stabilizes, you’re in the right range.
When γ Is Too High but ≤ 1.0
Symptom: it has a slow start, late convergence, and Q-values becoming very large.
Plot pattern: reward rises smoothly but only after many episodes (>2000).
Agent behavior: seems passive or overcautious early on, then suddenly becomes stable.
Explanation:
- Large γ values (0.99–0.998) propagate delayed rewards too far, so the agent updates slowly.
Fix:
- Use a smaller learning rate (α) to keep the updates stable,
- extend training episodes or increase the total number of episodes,
- regularly save Q-tables to see when the improvement saturates.
The Hidden Interaction: γ × α
If γ is large, α must be smaller — otherwise, updates overshoot and destabilize training.
If γ is small, α can be larger, because Q-values decay faster anyway.
As a rule of thumb:
- For γ ≈ 0.9 → α between 0.1–0.2 is fine.
- For γ ≥ 0.99 → α should be ≤ 0.05 for stability.
Conclusion
The discount factor γ is not just a mathematical term. It defines how an agent perceives time.
A wrong γ doesn’t just slow learning; it changes the personality of the agent.
The key to stable and efficient RL training is not finding the “perfect” γ, but aligning γ, α, ε, and environment characteristics into one reasonable learning dynamic.
Download & Reproduce the Experiments (GitHub Link)
Do you want to experiment with the discount factor? I made all the experiment files publicly available on GitHub. You can run the same setup, visualize the same results, and even push the boundaries further by changing the parameters.
You can download or clone the full project from my GitHub repository:
Q-Learning – Discount Factor Experiments (CartPole-v1)





