This page was last edited on 12 November 2025
ReLU stands for “Rectified Linear Unit.” It’s a simple function used in deep neural networks, including Deep Reinforcement Learning (Deep RL).
ReLU decides whether an artificial neuron should be activated or not. If the input is positive, the activation function returns the input. If not, it returns 0.
ReLU adds non-linearity, which is crucial for learning complex patterns.
Why do we use ReLU in Deep RL?
ReLU is fast to compute and works well in practice. It helps deep networks learn faster by avoiding vanishing gradients.
Speed and stability matter in Deep RL application where agents need to learn from delayed and noisy feedback. ReLU allows the network to focus on important signals and ignore weak or negative ones.
When Should You Use ReLU?
Use ReLU when:
- You need fast computation
- Your model is not extremely deep
- You’re building a Deep RL agent (e.g. DQN, PPO)
Avoid ReLU when:
- You see a lot of dead neurons (0 outputs)
- Your model needs smooth gradients
- You train very deep networks with sparse rewards
Comparing ReLU with Other Activation Functions
The goal here is to help readers understand when and why to use ReLU.
ReLU vs. Swish, GELU, ELU, SELU
| Function | Main Idea | Pros | Cons |
|---|---|---|---|
| ReLU | max(0, x) | Simple, fast, effective | Dying neurons, not smooth |
| Swish | x * sigmoid(x) | Smooth, often better than ReLU | Slightly slower |
| GELU | Gaussian noise-based | Used in transformers, smoother | Complex, not always better |
| ELU | Exponential for x < 0 | Avoids dead neurons | Slower, sensitive to params |
| SELU | Scaled ELU (self-normalizing) | Works well in deep nets (with dropout off) | Restrictive conditions |
- Swish and GELU perform better in some modern architectures but are more computationally expensive.
- ELU and SELU can prevent dying neurons and help with internal normalization, but they need careful tuning.
ReLU activation function equation
The equation is:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \text{ReLU}(x) = \max(0, x) \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-72c764c07c7e7f655529c09f746bf8c2_l3.png)
Where:
- x is the input value (can be from an artificial neuron, layer, or linear function)
- max picks the larger of 0 or x. If x>0, ReLU returns x. Otherwise, it returns 0.
ANALOGY
We can imagine a water pipe with a one-way valve. Water (input) flows only if the pressure (value) is positive. If there’s no pressure (negative or zero), the valve blocks it.
ReLU is that valve—it only lets positive signals pass through.
HISTORY
ReLU started gaining attention in 2010. It was introduced in the paper “Rectified Linear Units Improve Restricted Boltzmann Machines” by Glorot et al.
It outperformed sigmoid and tanh in deep networks. First widely used in computer vision tasks, then in RL with Deep Q-networks (DQN) by DeepMind in 2015.
Steps to implement ReLU activation function
- Take the input x.
- Check if x>0.
- If yes, return x.
- If not, return 0.
- Apply this element-wise to all inputs in a layer.
EXAMPLE: how ReLU filters out negative signals
Initial Inputs (one input per iteration):
- Iteration 1: x= −4
- Iteration 2: x= −1.2
- Iteration 3: x= 0
- Iteration 4: x= 3.5
- Iteration 5: x= 6.8
ReLU Calculations:
- Iteration 1: ReLU(-4)= max(0, -4)= 0
- Iteration 2: ReLU(-1.2)= max(0, -1.2)= 0
- Iteration 3: ReLU(0)= max(0, 0)= 0
- Iteration 4: ReLU(3.5)= max(0, 3.5)= 3.5
- Iteration 5: ReLU(6.8)= max(0, 6.8)= 6.8
Table of Results:
| Iteration | Input (x) | ReLU(x) |
|---|---|---|
| 1 | -4 | 0 |
| 2 | -1.2 | 0 |
| 3 | 0 | 0 |
| 4 | 3.5 | 3.5 |
| 5 | 6.8 | 6.8 |

Since ReLU filters out negative signals, the first three iterations output 0 — input values were negative or zero.
Last two iterations output the input itself — ReLU allowed positive values to pass.
This shows how ReLU filters out negative signals while preserving positive ones. It’s simple but powerful in shaping neural network activations.
References:
- Glorot, X., Bordes, A., & Bengio, Y. (2011). “Flatten-T Swish: a thresholded ReLU-Swish-like activation function for deep learning“
- Nair, V., & Hinton, G.E. (2010). “Rectified Linear Units Improve Restricted Boltzmann Machines.”
Gradient Descent << Previous | Next >> Artificial Neuron