Exploring or Exploiting? Understanding Epsilon-Greedy

This page was last edited on 12 November 2025

Epsilon-greedy is a selection strategy that balances exploration (trying new actions) and exploitation (choosing the best-known action). It choose between:

  • the best known action (exploitation) with a high probability. 
  • a random action (exploration) with a low probability.

We use Epsilon-greedy to avoid getting stuck in local optima. Without exploration, the agent most probably learns about a few actions and never tries better ones.

With Epsilon-greedy, we ensure the agent keeps exploring the environment, especially in the early stages of training. It’s a simple but effective way to improve learning and generalization.

Epsilon-greedy equations

There are two key equations:

1. Action Selection:

    \[ \hspace{5mm} \fbox{     \begin{array}{rl}         \vspace{2mm} \\         \displaystyle a_t =          \begin{cases}             \text{random action}, & \text{with probability } \varepsilon \\             \arg\max\limits_{a} Q(s_t, a), & \text{with probability } 1 - \varepsilon         \end{cases} \\         \vspace{2mm}     \end{array} } \hspace{5mm} \]

2. Epsilon Decay (optional, during training):

    \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          \varepsilon = \max \left( \varepsilon_{\text{min}}, \, \varepsilon_{\text{start}} \cdot \text{decay}^t \right) \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Where:

  • ε: exploration rate.
  • εstart: initial exploration rate (e.g., 1.0).
  • εmin​: minimum value (e.g., 0.01).
  • decay: factor (e.g., 0.99).
  • t: time step (or episode number).
  • Q(st​, a): current estimate of the value of taking action a in state st

ANALOGY

I’m in a new city and want to find the best restaurant.

  • On the first day, I tried random places (exploration).
  • After a week, I found a great one and kept going there (exploitation).

But now and then, I still want to try a new one in case it’s better (epsilon-greedy).
So epsilon is like how adventurous I want to be:

  • High epsilon = I want to try new places often.
  • Low epsilon = I stick with my favorite restaurants.

Inputs and outputs of epsilon-greedy

Inputs:

  • Current Q-values: Q(s, a).
  • Epsilon ε.
  • Action space A.

Output:

  • Selected action at

Assumptions:

  • Action space A={a1, a2, a3}
  • Q-values:
    • Q(s,a1​)= 5
    • Q(s,a2​)= 2
    • Q(s,a3​)= 0
  • ε= 0.4 (fixed, no decay)
  • Random seed fixed to simulate randomness.

We’ll generate random numbers to simulate exploration:

  • If random < ε → explore
  • Else → exploit (choose action with highest Q-value)

ITERATION 1:

  • Random = 0.25 → explore
  • Choose randomly → let’s say a3

ITERATION 2:

  • Random = 0.85 → exploit
  • Choose arg⁡ max ⁡Q = a1

ITERATION 3:

  • Random = 0.33 → explore
  • Choose randomly → let’s say a2

ITERATION 4:

  • Random = 0.92 → exploit
  • Choose a1

ITERATION 5:

  • Random = 0.10 → explore
  • Choose randomly → let’s say a2

IterationRandom ValueDecisionAction Chosen
10.25Explorea3
20.85Exploita1
30.33Explorea2
40.92Exploita1
50.10Explorea2

Interpretation:

  • 3 out of 5 actions were exploratory (a2, a3)
  • 2 out of 5 were greedy (a1)
  • This aligns with ε = 0.4 (40% exploration expected)

References:


Choosing RL Algorithm << Previous | Next >> SIM2REAL