This page was last edited on 12 November 2025
Epsilon-greedy is a selection strategy that balances exploration (trying new actions) and exploitation (choosing the best-known action). It choose between:
- the best known action (exploitation) with a high probability.
- a random action (exploration) with a low probability.
We use Epsilon-greedy to avoid getting stuck in local optima. Without exploration, the agent most probably learns about a few actions and never tries better ones.
With Epsilon-greedy, we ensure the agent keeps exploring the environment, especially in the early stages of training. It’s a simple but effective way to improve learning and generalization.
Epsilon-greedy equations
There are two key equations:
1. Action Selection:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{rl} \vspace{2mm} \\ \displaystyle a_t = \begin{cases} \text{random action}, & \text{with probability } \varepsilon \\ \arg\max\limits_{a} Q(s_t, a), & \text{with probability } 1 - \varepsilon \end{cases} \\ \vspace{2mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-d44a9eed904d95028223a096481ec5ac_l3.png)
2. Epsilon Decay (optional, during training):
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \varepsilon = \max \left( \varepsilon_{\text{min}}, \, \varepsilon_{\text{start}} \cdot \text{decay}^t \right) \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-bb3281e50bb5bc4b0716c2827dec31b0_l3.png)
Where:
- ε: exploration rate.
- εstart: initial exploration rate (e.g., 1.0).
- εmin: minimum value (e.g., 0.01).
- decay: factor (e.g., 0.99).
- t: time step (or episode number).
- Q(st, a): current estimate of the value of taking action a in state st
ANALOGY
I’m in a new city and want to find the best restaurant.
- On the first day, I tried random places (exploration).
- After a week, I found a great one and kept going there (exploitation).
But now and then, I still want to try a new one in case it’s better (epsilon-greedy).
So epsilon is like how adventurous I want to be:
- High epsilon = I want to try new places often.
- Low epsilon = I stick with my favorite restaurants.
Inputs and outputs of epsilon-greedy
Inputs:
- Current Q-values: Q(s, a).
- Epsilon ε.
- Action space A.
Output:
- Selected action at
EXAMPLE: How epsilon-greedy works
Assumptions:
- Action space A={a1, a2, a3}
- Q-values:
- Q(s,a1)= 5
- Q(s,a2)= 2
- Q(s,a3)= 0
- ε= 0.4 (fixed, no decay)
- Random seed fixed to simulate randomness.
We’ll generate random numbers to simulate exploration:
- If random < ε → explore
- Else → exploit (choose action with highest Q-value)
ITERATION 1:
- Random = 0.25 → explore
- Choose randomly → let’s say a3
ITERATION 2:
- Random = 0.85 → exploit
- Choose arg max Q = a1
ITERATION 3:
- Random = 0.33 → explore
- Choose randomly → let’s say a2
ITERATION 4:
- Random = 0.92 → exploit
- Choose a1
ITERATION 5:
- Random = 0.10 → explore
- Choose randomly → let’s say a2
| Iteration | Random Value | Decision | Action Chosen |
|---|---|---|---|
| 1 | 0.25 | Explore | a3 |
| 2 | 0.85 | Exploit | a1 |
| 3 | 0.33 | Explore | a2 |
| 4 | 0.92 | Exploit | a1 |
| 5 | 0.10 | Explore | a2 |
Interpretation:
- 3 out of 5 actions were exploratory (a2, a3)
- 2 out of 5 were greedy (a1)
- This aligns with ε = 0.4 (40% exploration expected)
References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
- David Silver (2015). Lectures on Reinforcement Learning.
- “Welcome to Spinning Up in Deep RL!” – Open AI.
Choosing RL Algorithm << Previous | Next >> SIM2REAL