Privileged Reinforcement Learning for Quadrupedal Robots
Teacher-student PPO policy for terrain traversal using only proprioceptive sensing
University of Michigan — Ann Arbor, College of Engineering
Model-based locomotion controllers for quadrupedal robots are complex state machines that require extensive
craftsmanship but still fail when traversing corner-case terrains. Reinforcement learning (RL) algorithms
serve as a compelling replacement, but they suffer a sim-to-real gap and typically require
billions of iterations to train. Our approach uses privileged RL training with a
teacher-student framework to overcome these challenges — the teacher policy has access to privileged
information (terrain profiles, contact states) during training, while the student policy learns to replicate
the teacher's behavior using only joint encoders available on the real robot.
Trained Agent Demo
The trained student policy navigating varied terrain in the RaiSim physics simulator.
Methodology
A two-phase privileged learning approach that bridges the sim-to-real gap.
The teacher policy receives privileged information that would not be available on a real robot: ground-truth contact states, terrain height profiles, foot contact forces, and exact body orientation. With this rich observation space, the teacher learns an expert locomotion policy quickly and reliably using PPO. The teacher outputs actions a — motor torques for all 12 joints.
The student policy observes only joint encoder readings — information available on a physical robot. It is trained via behavior cloning to imitate the teacher's actions, learning to infer terrain and contact information implicitly from proprioceptive history. This produces a deployable policy that performs nearly as well as the teacher without access to privileged state.
Simulation
Training loop powered by Proximal Policy Optimization in the RaiSim physics engine.
The agent (policy = robot controller) reads state: robot attitude, leg positions, velocity, and (for the teacher) terrain and contact data.
The policy network outputs control inputs — torques for all 12 joints of the quadruped.
The RaiSim physics engine simulates the robot's interaction with the environment and returns the next state.
The reward signal evaluates walking quality. PPO uses clipped surrogate loss with GAE to update the policy.
Reward Function
A shaped reward balancing forward progress against energy and stability penalties.
Rewards forward movement up to a velocity cap of 2 m/s, preventing the agent from learning unstable sprinting gaits.
Penalizes vertical oscillation and body rotation, encouraging a smooth, stable gait with minimal energy wasted on bouncing or rocking.
Penalizes excessive joint speeds across all 12 joints, promoting energy-efficient motion and smoother trajectories.
Penalizes total torque magnitude, reducing energy consumption and preventing the agent from relying on brute-force motor commands.
Terrain Curriculum
Progressively harder terrains train the agent to generalize across diverse environments.
The agent first learns basic locomotion on a flat surface, mastering balance and a stable forward gait.
Mild inclines and declines teach the agent to adjust its gait for elevation changes and maintain balance on gradients.
Steeper terrain requires more aggressive torque modulation and body posture adjustments to avoid slipping or falling.
The final stage combines rolling hills with irregular surfaces, demanding the full range of locomotion skills.
Hyperparameters
PPO and optimization parameters used for training.
Results
Key findings from training and ablation experiments.
PPO optimizes the weighted sum of the clipped advantage, L2 error of the value function estimator, and an entropy term. Our best setup significantly reduced the value function estimator L2 error compared to baseline, leading to more stable policy updates.
We experimented with adaptive vs. fixed learning rates. The adaptive schedule achieved higher peak reward and more consistent convergence, especially on the more challenging terrain stages.
The entropy term encourages exploration during training. Models trained with entropy regularization (c2 = 0.01) discovered more robust gaits and avoided early convergence to suboptimal locomotion patterns.
There are several promising directions for extending this work. Exploring alternative RL algorithms like Soft Actor-Critic (SAC) could improve sample efficiency while maintaining a broad action search space. Additional reward terms — such as a foot clearance reward to ensure the robot lifts its feet high enough to step over rough terrain — would improve robustness on highly irregular surfaces. Finally, deploying the trained student policy on physical hardware would be the ultimate validation of the sim-to-real transfer approach.