Environment + RL Model + Reward Function – Application 1

This page was last edited on 24 April 2025

This tutorial has six parts:

 PART 1: Overview of the tutorial 

 PART 2: Problem Definition 

 PART 3: Markov Decision Process (MDP)

 PART 4: Choosing the Algorithm (DQN) 

 PART 5: Environment + RL Model + Reward Function ← you are here

 PART 6: Training + Testing + Google Colab Access


This step brings the previous three steps together. It turns the theory into something simulated — something the agent can interact with.

At this step, we already know the agent, states, actions, and rewards. Now we use all the known data to simulate that process.

 To simulate the application, we create a loop where:

  • the agent sees an image,
  • takes an action,
  • receives a reward,
  • and moves to the next image.

Key Concepts and Principles

  • Encapsulation – the logic for state transitions, reward calculation, and label evaluation is hidden inside the environment.
  • Stochasticity – every call to step() gives a new image. It helps prevent overfitting.
  • Clarity – a clean separation between agent and environment.
  • Extendability – later we can inject noise, clutter, or add multiple digits.

The Simulation Environment

A simulation environment is a controlled world where the agent learns. The simulated world should mimic a real or abstract system — in our case, a digital eye trying to recognize the digit “3”.

The agent interacts with the environment:

  • One step at a time.
  • No humans in the loop.
  • Pure exploration and learning.

In Deep Reinforcement Learning, the environment does 4 things:

  1. Presents a state (image)
  2. Accepts an action (YES/NO)
  3. Returns a reward (+1 or -1)
  4. Transitions to the next state

We follow the OpenAI Gym style — step() and reset() functions.

What Will Our Environment Do?

Let’s define the full loop.

reset()

  • Called at the start of each episode.
  • Pick a random image from MNIST.
  • Returns the image as the initial state.

step(action)

  • Take the agent’s decision (YES or NO).
  • Compare it with the correct label.
  • Returns:
    • Reward: based on the reward function defined in Step 2.
    • Next image (new state).
    • Done flag: optional, can be False if training is infinite.
    • Info: optional, can return ground truth or accuracy for logging.
What is the Dataset Setup?

We use the MNIST dataset:

  • 60,000 images for training
  • 10,000 for testing
  • all digits 0–9 included

We preprocess images:

  • We preprocess images using Z-score standardization, not simple [0,1] normalization
  • Keep original shape: 1 x 28 x 28 (grayscale)

We label images as:

  • 1 if it contains digit 3
  • 0 otherwise

We split:

  • 90% → training (used during simulation)
  • 10% → testing (used during evaluation only)

This is the step where we build the artificial brain of the agent.

Until now, we’ve defined the task, the environment, and how learning works. Now we define how the agent actually learns to choose actions.

We build the model that maps:

images → Q-values → actions

The purpose of the model is to estimates Q(s, a).

The expected reward for taking action a in state s.

In our case:

  • s = grayscale image (28×28)
  • a = YES (1) or NO (0)
  • Q(s, a) = expected reward for each of the two actions

We use a neural network to approximate this Q-function. This is what makes Deep Q-Learning “deep”.

The CNN learns to extract visual features from raw pixels — no manual feature engineering.

The RL Model

Images have spatial structure. Pixels are not independent — nearby pixels form shapes and digits.

A CNN is the right tool:

  • Local filters detect edges, curves, corners
  • Layers learn hierarchical features
  • Final layers learn to map features to actions

We use a small and fast CNN — enough to learn digit features, but simple enough to train quickly.

Model Architecture

Input:

  • Shape: 1 x 28 x 28
  • Type: grayscale image
  • Z-score standardization

Layers:

  • Conv2D → ReLU  
  • Conv2D → ReLU  
  • Flatten  
  • Fully Connected → Output (2 units)

Output:

  • 2 Q-values: one for each action:
    • Q(s, NO)
    • Q(s, YES)

The action with the highest Q-value is selected during exploitation.

Activation and Loss

Activations

  • Use ReLU in hidden layers → fast, non-linear, sparse.
  • No activation in output → we want raw Q-values.

Loss Function

  • Use Mean Squared Error (MSE) between predicted Q and target Q.
  • This is typical for value-based RL.
Optimizer and Training

Use Adam as the optimizer. It handles noisy gradients well. Learning rate: start with 1e-4 or 1e-3.

Training happens using TD updates:

  • Sample batch from replay memory
  • Compute target Q using Bellman equation
  • Minimize the difference between:
    • Q(s, a) predicted by the model
    • Target Q(s, a) from reward + γ * max Q(s’, a’)

This is the moment when we teach the agent what’s good and what’s bad.

The reward function is the only feedback the agent sees. It tells the agent if its last action was right or wrong.

In Deep RL, rewards are not optional — they are the fuel for learning. Without clear, consistent rewards, the agent will fail. With the right feedback, it will learn faster and better.

Key Concepts

  • Immediate reward — no delay, the agent learns instantly after every step.
  • Binary scoring — perfect for discrete classification tasks.
  • Dense feedback — every image provides a learning opportunity.
  • Simplicity — no extra weights or thresholds to tune.
The Reward Function

Reinforcement Learning is not supervised.  There are no labels directly telling the agent what to do.  Instead, the agent must discover a policy through trial and error.

That’s why rewards must be:

  • Simple
  • Immediate
  • Aligned with the goal

In our case, the goal is to recognize the digit 3. So the reward function must punish the agent when it misses a 3, and reward it when it finds one.

The logic is binary and symmetric:

  • Correct prediction → +1
  • Wrong prediction → –1

This works well because:

  • It creates a dense reward: every step gives feedback.
  • It avoids sparse feedback (which slows learning).
  • It aligns perfectly with the binary nature of the task.

No partial rewards.
No probabilistic scoring.
Just win or lose — one decision at a time.

We reward the agent when its action matches the true label. And we penalize it when it makes a mistake.

This reward structure also balances the dataset:

  • It doesn’t matter how many 3s or non-3s there are.
  • The feedback only depends on matching the truth.

Now the system is ready. The environment is in place. The model is built. The reward function is set.

In Part 6, we bring everything together and train the agent.

You’ll see:

→ how the agent explores,
→ how it learns from feedback,
→ and how performance is measured.

We’ll use accuracy, precision, recall, and F1 score to track progress.
You’ll also learn how to test the trained model and visualize the results.

All of this runs inside Google Colab — fast, free, and shareable.


Part 6: Training + Testing + Google Colab Access