🎯 AIMS Coursework

REINFORCE — CartPole

Late 2024 — Policy Gradient Method

Policy Gradient REINFORCE Monte Carlo JAX/Haiku

About

Implementation of the REINFORCE algorithm — a Monte Carlo policy gradient method that directly optimizes the policy by reinforcing actions that lead to higher returns. Unlike DQN which learns value functions, REINFORCE learns a stochastic policy directly.

View on GitHub →

Method: REINFORCE Algorithm

Policy Gradient Objective

Maximize expected cumulative reward:

J(πθ) = Eτ~π[Ψ(τ)]

Gradient Computation

Weight log-probabilities by returns:

θJ = E[Σt Gtθ log πθ(at|st)]

Neural Network Policy

  • • Input: 4-dim observation
  • • 2 hidden layers × 20 units
  • • Output: action logits → softmax

Training Details

  • • Episodes: 2,500
  • • Learning rate: 1e-3
  • • Discount γ: 0.99

DQN vs REINFORCE

Aspect DQN REINFORCE
Learns Value function Q(s,a) Policy π(a|s) directly
Policy type Deterministic (argmax) Stochastic (sampling)
Update Every step (TD) End of episode (MC)
Replay buffer ✅ Yes ❌ No (on-policy)

Results

✅ Solved! Agent reliably reaches max reward (500) after training

Performance Metrics:

  • Initial: Episode returns close to 0
  • Final: Episode returns reliably reaching 500
  • Training: 2,500 episodes, 2 learning steps per episode

Key Takeaways

  • Policy gradients can solve CartPole without learning Q-values
  • Stochastic policy provides natural exploration
  • Monte Carlo returns require full episodes (no bootstrapping)
  • Simpler implementation than DQN (no replay buffer, target network)