🎯
AIMS Coursework
REINFORCE — CartPole
Late 2024 — Policy Gradient Method
Policy Gradient
REINFORCE
Monte Carlo
JAX/Haiku
About
Implementation of the REINFORCE algorithm — a Monte Carlo policy gradient method that directly optimizes the policy by reinforcing actions that lead to higher returns. Unlike DQN which learns value functions, REINFORCE learns a stochastic policy directly.
View on GitHub →Method: REINFORCE Algorithm
Policy Gradient Objective
Maximize expected cumulative reward:
J(πθ) = Eτ~π[Ψ(τ)]
Gradient Computation
Weight log-probabilities by returns:
∇θJ = E[Σt Gt ∇θ log πθ(at|st)]
Neural Network Policy
- • Input: 4-dim observation
- • 2 hidden layers × 20 units
- • Output: action logits → softmax
Training Details
- • Episodes: 2,500
- • Learning rate: 1e-3
- • Discount γ: 0.99
DQN vs REINFORCE
| Aspect | DQN | REINFORCE |
|---|---|---|
| Learns | Value function Q(s,a) | Policy π(a|s) directly |
| Policy type | Deterministic (argmax) | Stochastic (sampling) |
| Update | Every step (TD) | End of episode (MC) |
| Replay buffer | ✅ Yes | ❌ No (on-policy) |
Results
✅ Solved! Agent reliably reaches max reward (500) after training
Performance Metrics:
- • Initial: Episode returns close to 0
- • Final: Episode returns reliably reaching 500
- • Training: 2,500 episodes, 2 learning steps per episode
Key Takeaways
- ✅ Policy gradients can solve CartPole without learning Q-values
- ✅ Stochastic policy provides natural exploration
- ✅ Monte Carlo returns require full episodes (no bootstrapping)
- ✅ Simpler implementation than DQN (no replay buffer, target network)