Gradient Descent: Step by Step to Success

This page was last edited on 12 November 2025

Gradient Descent is how we train most machine learning and deep learning models.

It’s a simple idea: we want to minimize a cost function (error), and we do it by taking small steps in the direction that reduces the error the most.

That direction is given by the gradient — the slope of the function.

Why it’s important

Without Gradient Descent, we wouldn’t be able to train neural networks. It’s the engine that drives learning.

We use it to:

  • Train models to classify images, detect objects, predict values
  • Learn the best weights for neural networks
  • Fine-tune any model to reduce error

No matter the algorithm (CNN, RNN, Transformer, etc.), some form of Gradient Descent is used behind the scenes.

STEP 1: Gradient Descent formula

The gradient descent is the derivative of a cost function f(x), which shows us the direction of change to minimize this function.

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle x_{\text{new}} = x_{\text{old}} - \alpha \cdot \nabla f(x) \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Where:

  • xnew: the updated value of x.
  • xold: the previous value of x.
  • α: learning rate (a manually chosen hyperparameter).
  • ∇f(x): gradient of the function (derivative of the function f(x)).

STEP 2: Applying the Gradient Descent formula

In this example, the Gradient Descent uses the Gradient to find the minimum of a function.

Suppose we want to minimize the function:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle f(x) = x^2 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

We differentiate the function to obtain the gradient:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(x) = \frac{d}{dx} (x^2) = 2x \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

STEP 3: Step-by-step manual calculation

Initial values:

  • Starting point: x0= 5
  • Learning rate: α= 0.1

ITERATION 1

Calculating the Gradient Descent at x0= 5

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(x_0) = 2(5) = 10 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Update x:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle x_1 = x_0 - \alpha \cdot \nabla f(x_0) \\           \vspace{3mm} \\         \displaystyle x_1 = 5 - 0.1 \times 10 = 5 - 1 = 4 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 2

Calculating the Gradient Descent at x1= 4

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(x_1) = 2(4) = 8 \\           \vspace{3mm} \\         \displaystyle x_2 = 4 - 0.1 \times 8 = 4 - 0.8 = 3.2 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 3

Calculating the Gradient Descent at x2= 3.2

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(x_2) = 2(3.2) = 6.4 \\           \vspace{3mm} \\         \displaystyle x_3 = 3.2 - 0.1 \times 6.4 = 3.2 - 0.64 = 2.56 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 4

Calculating the Gradient Descent at x3= 2.56

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(x_3) = 2(2.56) = 5.12 \\           \vspace{3mm} \\         \displaystyle x_4 = 2.56 - 0.1 \times 5.12 = 2.56 - 0.512 = 2.048 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 5

Calculating the Gradient Descent at x4= 2.048

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(x_4) = 2(2.048) = 4.096 \\           \vspace{3mm} \\         \displaystyle x_5 = 2.048 - 0.1 \times 4.096 = 2.048 - 0.409 = 1.639 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


As we iterate, the values ​​of x become smaller and smaller, approaching 0, which is the minimum of the function f(x)=x2. The gradient decreases, which means that the steps become smaller and the algorithm gradually converges towards the minimum.

The gradient tells us the direction of steepest descent. If the gradient is positive, we move left; if negative, we move right.

The learning rate controls how big the step is.

This is the basic mechanism by which Gradient Descent works.

Bellow is is the graph illustrating the five iterations of Gradient Descent applied to f(x)=x2. Each red point represents an iteration, showing how x moves toward the minimum (x=0).

Gradient descendent iteration from 1 to 5
Gradient descent iteration from 1 to 5

Types of Gradient Descent

There are 3 main types. All do the same thing: minimize a cost function by updating the parameters — but they differ in how much data we use at each step.

  1. Batch Gradient Descent
    We use the entire dataset to compute the gradient at each step. It’s accurate but slow, especially on large datasets. It also doesn’t allow real-time updates.
    Use it when:
    • The dataset is small
    • You want stable updates

  2. Stochastic Gradient Descent (SGD)
    We use just one sample at a time to update the parameters. It’s fast, noisy, and less stable — but that noise can help escape local minima. It updates frequently and allows for online learning.
    Use it when:
    • You want fast learning
    • Real-time updates are important

  3. Mini-batch Gradient Descent
    We split the data into small batches (e.g. 32, 64, 128 samples). It’s a balance between batch and SGD — faster than full batch, more stable than SGD. It’s the most popular version in deep learning.
    Use it when:
    • Training deep networks
    • You need performance and stability

References:


Weight Initialization << Previous | Next >> ReLU