BipedalWalker: PPO from Scratch

Training a walking agent in real-time, entirely in your browser

This demo trains a Proximal Policy Optimization (PPO) agent to walk across procedurally generated terrain. The physics simulation runs via planck.js (Box2D), and the neural networks train live using TensorFlow.js. Everything runs client-side — no server needed. The agent starts with random actions and gradually learns to balance, take steps, and eventually walk forward. Tune hyperparameters and watch how they affect learning.

Live Training

Watch the agent learn to walk. Initial progress (standing, not falling) appears within a few minutes. Use turbo mode to speed up training.

Loading TensorFlow.js...

Speed 4

Turbo

Episodes 0

Best Reward —

Avg Reward (10) —

Record Distance —

Total Steps 0

Hyperparameters

Adjust PPO hyperparameters. "Apply & Reset" starts fresh; "Apply (Keep Weights)" updates parameters without losing learned behavior.

Learning Rate

3e-4

Discount Factor (gamma)

0.990

Clip Ratio

0.20

Entropy Coefficient

0.010

Progress Threshold

1.0

Min Hull Height

1.4

Gait Reward

0.05

Reward History

Episode rewards over time. The red line shows a 10-episode moving average.

PPO Pipeline

How Proximal Policy Optimization trains the walking agent.

Collect Rollout

Run the current policy for 2048 steps in the environment, storing observations, actions, rewards, and log-probabilities.

Compute Advantages

Use Generalized Advantage Estimation (GAE) with lambda=0.95 to compute how much better each action was than expected.

Optimize Policy

Run 10 epochs of mini-batch gradient descent with clipped surrogate loss, value loss, and entropy bonus to update actor and critic.

Iterate

Repeat the collect-optimize cycle. The clipping mechanism ensures stable updates, preventing the policy from changing too drastically.

Why PPO?

Proximal Policy Optimization strikes a balance between sample efficiency and implementation simplicity. Unlike vanilla policy gradient methods that can diverge with large updates, PPO's clipped surrogate objective constrains how far the new policy can deviate from the old one. This makes training more stable without the complexity of trust region methods (TRPO). PPO has become a default choice for continuous control tasks, powering everything from robotic locomotion to game-playing agents, and it was a key algorithm in training large language models via RLHF.