This page was last edited on 12 November 2025
Gradient Descent is how we train most machine learning and deep learning models.
It’s a simple idea: we want to minimize a cost function (error), and we do it by taking small steps in the direction that reduces the error the most.
That direction is given by the gradient — the slope of the function.
Why it’s important
Without Gradient Descent, we wouldn’t be able to train neural networks. It’s the engine that drives learning.
We use it to:
- Train models to classify images, detect objects, predict values
- Learn the best weights for neural networks
- Fine-tune any model to reduce error
No matter the algorithm (CNN, RNN, Transformer, etc.), some form of Gradient Descent is used behind the scenes.
STEP 1: Gradient Descent formula
The gradient descent is the derivative of a cost function f(x), which shows us the direction of change to minimize this function.
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle x_{\text{new}} = x_{\text{old}} - \alpha \cdot \nabla f(x) \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-91f00d01140f82afdb0a999c32630b77_l3.png)
Where:
- xnew: the updated value of x.
- xold: the previous value of x.
- α: learning rate (a manually chosen hyperparameter).
- ∇f(x): gradient of the function (derivative of the function f(x)).
STEP 2: Applying the Gradient Descent formula
In this example, the Gradient Descent uses the Gradient to find the minimum of a function.
Suppose we want to minimize the function:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle f(x) = x^2 \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-8739af3512b0b2438cc30563f45366e1_l3.png)
We differentiate the function to obtain the gradient:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \nabla f(x) = \frac{d}{dx} (x^2) = 2x \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-bb67c2f751fbb2cce9de2f3563770841_l3.png)
STEP 3: Step-by-step manual calculation
Initial values:
- Starting point: x0= 5
- Learning rate: α= 0.1
ITERATION 1
Calculating the Gradient Descent at x0= 5
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \nabla f(x_0) = 2(5) = 10 \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-9649646e1a5950f884c83918f2c4d4cd_l3.png)
Update x:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle x_1 = x_0 - \alpha \cdot \nabla f(x_0) \\ \vspace{3mm} \\ \displaystyle x_1 = 5 - 0.1 \times 10 = 5 - 1 = 4 \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-58062971c18d56d3b8ea385b06fc7db3_l3.png)
ITERATION 2
Calculating the Gradient Descent at x1= 4
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \nabla f(x_1) = 2(4) = 8 \\ \vspace{3mm} \\ \displaystyle x_2 = 4 - 0.1 \times 8 = 4 - 0.8 = 3.2 \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-904071e03e1f6de1c83a3d8f88ef3268_l3.png)
ITERATION 3
Calculating the Gradient Descent at x2= 3.2
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \nabla f(x_2) = 2(3.2) = 6.4 \\ \vspace{3mm} \\ \displaystyle x_3 = 3.2 - 0.1 \times 6.4 = 3.2 - 0.64 = 2.56 \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-dd3e1ccdfd70e45f857257ec46cb6965_l3.png)
ITERATION 4
Calculating the Gradient Descent at x3= 2.56
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \nabla f(x_3) = 2(2.56) = 5.12 \\ \vspace{3mm} \\ \displaystyle x_4 = 2.56 - 0.1 \times 5.12 = 2.56 - 0.512 = 2.048 \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-1f7905080705a746d678460127dff3ab_l3.png)
ITERATION 5
Calculating the Gradient Descent at x4= 2.048
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \nabla f(x_4) = 2(2.048) = 4.096 \\ \vspace{3mm} \\ \displaystyle x_5 = 2.048 - 0.1 \times 4.096 = 2.048 - 0.409 = 1.639 \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-75e6104bb0faf1629f397cf1fc05568e_l3.png)
As we iterate, the values of x become smaller and smaller, approaching 0, which is the minimum of the function f(x)=x2. The gradient decreases, which means that the steps become smaller and the algorithm gradually converges towards the minimum.
The gradient tells us the direction of steepest descent. If the gradient is positive, we move left; if negative, we move right.
The learning rate controls how big the step is.
This is the basic mechanism by which Gradient Descent works.
Bellow is is the graph illustrating the five iterations of Gradient Descent applied to f(x)=x2. Each red point represents an iteration, showing how x moves toward the minimum (x=0).

Types of Gradient Descent
There are 3 main types. All do the same thing: minimize a cost function by updating the parameters — but they differ in how much data we use at each step.
- Batch Gradient Descent
We use the entire dataset to compute the gradient at each step. It’s accurate but slow, especially on large datasets. It also doesn’t allow real-time updates.
Use it when:- The dataset is small
- You want stable updates
- Stochastic Gradient Descent (SGD)
We use just one sample at a time to update the parameters. It’s fast, noisy, and less stable — but that noise can help escape local minima. It updates frequently and allows for online learning.
Use it when:- You want fast learning
- Real-time updates are important
- Mini-batch Gradient Descent
We split the data into small batches (e.g. 32, 64, 128 samples). It’s a balance between batch and SGD — faster than full batch, more stable than SGD. It’s the most popular version in deep learning.
Use it when:- Training deep networks
- You need performance and stability
References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
- James Stewart (2015). Calculus. Cengage Learning.
- Ian Goodfellow & Yoshua Bengio & Aaron Courville (2016). Deep Learning (Adaptive Computation and Machine Learning series). MIT Press.
- K. F. Riley & M. P. Hobson & S. J. Bence (2006). Mathematical Methods for Physics and Engineering: A Comprehensive Guide. Cambridge University Press
- David Silver (2015). Lectures on Reinforcement Learning.
Weight Initialization << Previous | Next >> ReLU