Privileged Reinforcement Learning for Quadrupedal Robots

Teacher-student PPO policy for terrain traversal using only proprioceptive sensing

Steven Hong, Shugo Kaneko, Byungjin Kim, Elijah Hodges, Nicholas Matton

University of Michigan — Ann Arbor, College of Engineering

Model-based locomotion controllers for quadrupedal robots are complex state machines that require extensive craftsmanship but still fail when traversing corner-case terrains. Reinforcement learning (RL) algorithms serve as a compelling replacement, but they suffer a sim-to-real gap and typically require billions of iterations to train. Our approach uses privileged RL training with a teacher-student framework to overcome these challenges — the teacher policy has access to privileged information (terrain profiles, contact states) during training, while the student policy learns to replicate the teacher's behavior using only joint encoders available on the real robot.

Trained Agent Demo

The trained student policy navigating varied terrain in the RaiSim physics simulator.

Methodology

A two-phase privileged learning approach that bridges the sim-to-real gap.

Phase 1

Teacher Policy

The teacher policy receives privileged information that would not be available on a real robot: ground-truth contact states, terrain height profiles, foot contact forces, and exact body orientation. With this rich observation space, the teacher learns an expert locomotion policy quickly and reliably using PPO. The teacher outputs actions a — motor torques for all 12 joints.

Phase 2

Student Policy

The student policy observes only joint encoder readings — information available on a physical robot. It is trained via behavior cloning to imitate the teacher's actions, learning to infer terrain and contact information implicitly from proprioceptive history. This produces a deployable policy that performs nearly as well as the teacher without access to privileged state.

Simulation

Training loop powered by Proximal Policy Optimization in the RaiSim physics engine.

Agent Observes

The agent (policy = robot controller) reads state: robot attitude, leg positions, velocity, and (for the teacher) terrain and contact data.

Action Output

The policy network outputs control inputs — torques for all 12 joints of the quadruped.

RaiSim Steps

The RaiSim physics engine simulates the robot's interaction with the environment and returns the next state.

Reward & Update

The reward signal evaluates walking quality. PPO uses clipped surrogate loss with GAE to update the policy.

Reward Function

A shaped reward balancing forward progress against energy and stability penalties.

J = 0.4 r_v − 0.3 r_b − (3 × 10^-5) r_jv − (4 × 10^-5) r_T

Forward Velocity

r_v = min(v_x, 2)

Rewards forward movement up to a velocity cap of 2 m/s, preventing the agent from learning unstable sprinting gaits.

Body Motion Penalty

r_b = 1.25 v_z² + 0.4 ω_roll² + 0.4 ω_pitch²

Penalizes vertical oscillation and body rotation, encouraging a smooth, stable gait with minimal energy wasted on bouncing or rocking.

Joint Velocity

r_jv = Σ_i¹² v_ji²

Penalizes excessive joint speeds across all 12 joints, promoting energy-efficient motion and smoother trajectories.

Joint Torque

r_T = Σ_i¹² ||τ_i||₂

Penalizes total torque magnitude, reducing energy consumption and preventing the agent from relying on brute-force motor commands.

Terrain Curriculum

Progressively harder terrains train the agent to generalize across diverse environments.

Flat Ground

The agent first learns basic locomotion on a flat surface, mastering balance and a stable forward gait.

Gentle Slopes

Mild inclines and declines teach the agent to adjust its gait for elevation changes and maintain balance on gradients.

Steep Slopes

Steeper terrain requires more aggressive torque modulation and body posture adjustments to avoid slipping or falling.

Hills & Rough Terrain

The final stage combines rolling hills with irregular surfaces, demanding the full range of locomotion skills.

Hyperparameters

PPO and optimization parameters used for training.

Reward discount factor γ 0.996

GAE parameter λ 0.95

Clip ratio 0.2

Value function coefficient c₁ 0.5

Entropy coefficient c₂ 0.01

Adam learning rate 5e-4

Adam learning epochs 4

Results

Key findings from training and ablation experiments.

Improved Value Estimation

PPO optimizes the weighted sum of the clipped advantage, L2 error of the value function estimator, and an entropy term. Our best setup significantly reduced the value function estimator L2 error compared to baseline, leading to more stable policy updates.

Adaptive Learning Rate

We experimented with adaptive vs. fixed learning rates. The adaptive schedule achieved higher peak reward and more consistent convergence, especially on the more challenging terrain stages.

Entropy Regularization

The entropy term encourages exploration during training. Models trained with entropy regularization (c₂ = 0.01) discovered more robust gaits and avoided early convergence to suboptimal locomotion patterns.

Future Work

There are several promising directions for extending this work. Exploring alternative RL algorithms like Soft Actor-Critic (SAC) could improve sample efficiency while maintaining a broad action search space. Additional reward terms — such as a foot clearance reward to ensure the robot lifts its feet high enough to step over rough terrain — would improve robustness on highly irregular surfaces. Finally, deploying the trained student policy on physical hardware would be the ultimate validation of the sim-to-real transfer approach.