Definition

Reward shaping is the practice of augmenting a reinforcement learning (RL) agent's reward signal with additional intermediate rewards that guide learning toward the desired behavior. In robotics, the natural reward for most tasks is sparse: the robot receives a +1 only when the task is fully completed (e.g., the object reaches its target location) and 0 otherwise. With sparse rewards, the agent must stumble upon success through random exploration before it can learn anything, which is impractical for complex manipulation tasks that may require hundreds of sequential actions.

Reward shaping addresses this by providing a denser signal — small rewards for making progress toward the goal. For example, a shaped reward for a pick-and-place task might include a term proportional to the negative distance between the gripper and the target object, encouraging the agent to approach the object even before it learns to grasp. Well-designed shaped rewards can reduce training time from billions of environment steps to millions, making RL feasible for real-world robotics.

How It Works

The shaped reward R'(s, a, s') augments the original environment reward R(s, a, s') with an additional shaping function F:

R'(s, a, s') = R(s, a, s') + F(s, a, s')

The shaping function F encodes domain knowledge about what constitutes "progress." In a reaching task, F might be the decrease in distance to the target. In an assembly task, F might reward alignment of parts, establishment of contact, or insertion depth. The key challenge is designing F so that maximizing R' also maximizes R — that is, the shaped reward does not introduce incentives that conflict with the true objective.

During training, the RL algorithm (PPO, SAC, or TD3 are common choices for robotic manipulation) uses R' as its optimization target. The shaped terms provide gradient signal in states where the sparse reward is zero, enabling the policy to improve long before it achieves full task success.

Types of Reward Shaping

  • Potential-based reward shaping (PBRS) — Defines F(s, s') = γΦ(s') - Φ(s), where Φ is a potential function over states and γ is the discount factor. Ng et al. (1999) proved that PBRS preserves the optimal policy of the original MDP — the agent cannot be misled by the shaping. The potential function typically encodes distance to the goal or task progress.
  • Distance-based shaping — Rewards decrease in distance between relevant objects (gripper-to-object, object-to-target). Simple and widely used, but not guaranteed to preserve the optimal policy unless cast as potential-based shaping.
  • Demonstration-guided shaping — Uses expert demonstrations to define the reward: states similar to demonstrated states receive higher rewards. This bridges RL and imitation learning, using demonstrations to shape exploration rather than directly cloning behavior.
  • Curriculum-based shaping — Progressively adjusts the difficulty or the shaping magnitude during training. Early in training, strong shaping guides exploration; later, shaping is reduced so the agent optimizes the true objective.
  • Language-model-generated rewards — An emerging approach where LLMs generate reward function code from natural language task descriptions. The LLM translates "pick up the red cup" into a Python function computing gripper-cup distance, grasp closure, and lift height rewards. Eureka (2023) and Text2Reward demonstrate this approach.

Dangers: Reward Hacking and Goodhart's Law

Reward hacking occurs when the agent finds a way to maximize the shaped reward without accomplishing the intended task. This is Goodhart's Law applied to robotics: "When a measure becomes a target, it ceases to be a good measure." Examples in manipulation:

  • A distance-based reward that measures gripper-to-object distance. The agent learns to hover the gripper next to the object without ever grasping it, because grasping risks increasing the distance momentarily.
  • A contact reward for peg insertion. The agent learns to press the peg against the side of the hole (maintaining contact) rather than inserting it, because insertion requires briefly breaking contact.
  • A velocity reward intended to encourage fast task completion. The agent learns to oscillate rapidly without making progress, accumulating velocity reward.

Mitigations include using potential-based shaping (provably safe), carefully testing shaped rewards in simulation before deployment, combining sparse success signals with mild shaping, and iteratively refining rewards based on observed agent behavior.

Comparison with Alternatives

Reward shaping vs. imitation learning: Imitation learning sidesteps reward design entirely by learning from demonstrations. Methods like behavior cloning and ACT require no reward function at all. This is a major practical advantage: designing good rewards is often harder than collecting 50 demonstrations. For teams without deep RL expertise, imitation learning is typically the faster path to working manipulation policies.

Reward shaping vs. hindsight experience replay (HER): HER retroactively relabels failed trajectories as successes for alternative goals, effectively creating dense reward from sparse feedback without manual reward engineering. It works well for goal-conditioned tasks but requires the ability to redefine goals post-hoc.

Reward shaping vs. intrinsic motivation: Curiosity-driven and exploration-bonus methods add rewards for visiting novel states, addressing the exploration problem without task-specific knowledge. They complement rather than replace task reward shaping.

Practical Requirements

Simulation: Reward shaping is almost always developed and tested in simulation first. Physics simulators like Isaac Sim, MuJoCo, or PyBullet provide the fast iteration cycles needed to test whether shaped rewards produce the intended behavior. A single reward design iteration (train policy, observe behavior, adjust reward) takes hours in simulation vs. days on real hardware.

State access: Shaped rewards typically require access to privileged state information (object positions, contact forces, joint angles) that may not be available from the robot's sensors alone. In simulation, this information is freely available. For real-world RL, shaped rewards may require additional sensing or state estimation.

Compute: RL with reward shaping still requires substantial compute: typically 1-10 GPU hours for manipulation tasks in simulation with PPO or SAC. Training on real hardware is 100-1000x slower due to real-time constraints.

Expertise: Effective reward shaping requires understanding of both the task physics and RL dynamics. Poorly shaped rewards can be worse than sparse rewards if they introduce local optima that trap the agent.

Key Papers

  • Ng, A.Y., Harada, D., & Russell, S. (1999). "Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping." ICML 1999. The foundational paper proving that potential-based shaping preserves optimal policies.
  • Andrychowicz, M. et al. (2017). "Hindsight Experience Replay." NeurIPS 2017. An alternative to reward shaping that relabels failed trajectories with achieved goals.
  • Ma, Y.J. et al. (2023). "Eureka: Human-Level Reward Design via Coding Large Language Models." ICLR 2024. Uses GPT-4 to generate and iteratively refine reward functions from task descriptions.
  • Rajeswaran, A. et al. (2018). "Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations." RSS 2018. Combines demonstration-based reward shaping with RL for dexterous hand manipulation.

Related Terms

Apply This at SVRC

Silicon Valley Robotics Center offers both RL and imitation learning pathways for manipulation policy training. Our RL environment service provides pre-configured Isaac Sim and MuJoCo setups with tested reward functions for common tasks. For teams that want to skip reward engineering entirely, our data services collect high-quality demonstrations for imitation learning approaches like ACT and Diffusion Policy.

Explore RL Environment   Contact Us