BipedalWalker: PPO from Scratch
Training a walking agent in real-time, entirely in your browser
This demo trains a Proximal Policy Optimization (PPO) agent to walk across procedurally
generated terrain. The physics simulation runs via planck.js (Box2D), and the neural networks train
live using TensorFlow.js. Everything runs client-side — no server needed. The agent starts with random
actions and gradually learns to balance, take steps, and eventually walk forward. Tune hyperparameters
and watch how they affect learning.
Live Training
Watch the agent learn to walk. Initial progress (standing, not falling) appears within a few minutes. Use turbo mode to speed up training.
Hyperparameters
Adjust PPO hyperparameters. "Apply & Reset" starts fresh; "Apply (Keep Weights)" updates parameters without losing learned behavior.
Reward History
Episode rewards over time. The red line shows a 10-episode moving average.
PPO Pipeline
How Proximal Policy Optimization trains the walking agent.
Run the current policy for 2048 steps in the environment, storing observations, actions, rewards, and log-probabilities.
Use Generalized Advantage Estimation (GAE) with lambda=0.95 to compute how much better each action was than expected.
Run 10 epochs of mini-batch gradient descent with clipped surrogate loss, value loss, and entropy bonus to update actor and critic.
Repeat the collect-optimize cycle. The clipping mechanism ensures stable updates, preventing the policy from changing too drastically.
Proximal Policy Optimization strikes a balance between sample efficiency and implementation simplicity. Unlike vanilla policy gradient methods that can diverge with large updates, PPO's clipped surrogate objective constrains how far the new policy can deviate from the old one. This makes training more stable without the complexity of trust region methods (TRPO). PPO has become a default choice for continuous control tasks, powering everything from robotic locomotion to game-playing agents, and it was a key algorithm in training large language models via RLHF.