This page was last edited on 12 November 2025
Reward shaping is a technique in Reinforcement Learning(RL) where we modify the original reward signal to guide the agent more effectively.
It’s like adding “hints” to help the agent learn faster and more efficiently.
Why do we use reward shaping in Deep RL?
In Deep RL, the reward signal can often be sparse or delayed. Without guidance, the agent may take a long time to learn.
Reward shaping helps by providing extra feedback. It reduces exploration time and improves sample efficiency.
When Should We Use Reward Shaping?
Use it when the agent is lost.
If our agent receives reward only at the end of the episode, it has no clue what it’s doing in the middle. That’s like telling someone “you passed the exam” without saying which answers were right.
Reward shaping gives feedback during the process.
Use it when:
- The reward is sparse (only at the end).
- The task has a long horizon (many steps until a result).
- The agent takes random actions for too long.
- The base reward doesn’t reflect progress toward the goal.
When Should We Avoid Reward Shaping?
Reward shaping can help — or ruin our training.
Use it with care.
Avoid reward shaping when:
- We don’t fully understand the task dynamics.
- We create heuristics that conflict with the true objective.
- We give too much feedback and the agent learns to exploit it.
- We risk reward hacking — where the agent finds tricks to maximize shaped rewards without solving the real task.
Example of reward hacking:
We give +1 when a robot moves closer to the ball. It starts circling the ball forever without kicking it — maximizing shaped reward, not solving the game.
Also avoid shaping when the base reward is already dense and informative. Shaping should guide learning — not replace it.
ANALOGY
Imagine training a dog to sit. We give a treat only after it fully sits (sparse reward).
With reward shaping, we also praise it when it starts to lower its body (dense feedback). It learns faster because it knows it’s on the right path.
HISTORY
Reward shaping was formalized in 1999 by Andrew Ng, Daishi Harada, and Stuart Russell. Paper: “Policy Invariance under Reward Transformations”
They introduced potential-based reward shaping—a method that guarantees policy optimality is preserved. It became a key tool in modern reinforcement learning reward shaping techniques.
Steps for implementing reward shaping
- Define your base environment reward R(s, a, s’)
- Identify learning bottlenecks (e.g., sparse rewards, long horizons)
- Design shaping function F(s, a, s’) — can be heuristic or potential-based
- Check if shaping preserves policy invariance
- Apply the shaped reward R’ = R + F
- Train the agent using R’
Inputs and outputs of a reward shaping module
Inputs:
- State s
- Action a
- Next state s’
- Base reward R(s, a, s’)
- Optional: potential function Φ(s)
Output:
- Shaped reward R'(s, a, s’)
References:
- Ng, A.Y., Harada, D., & Russell, S. (1999). Policy Invariance Under Reward Transformations. ICML.
Model Free Learning << Previous | Next >> On-Policy vs Off-Policy Learning