🎯 AIMS Coursework

REINFORCE — CartPole

Late 2024 — Policy Gradient Method

Policy Gradient REINFORCE Monte Carlo JAX/Haiku

About

Implementation of the REINFORCE algorithm — a Monte Carlo policy gradient method that directly optimizes the policy by reinforcing actions that lead to higher returns. Unlike DQN which learns value functions, REINFORCE learns a stochastic policy directly.

View on GitHub →

Method: REINFORCE Algorithm

Policy Gradient Objective

Maximize expected cumulative reward:

J(π_θ) = E_τ~π[Ψ(τ)]

Gradient Computation

Weight log-probabilities by returns:

∇_θJ = E[Σ_t G_t ∇_θ log π_θ(a_t|s_t)]

Neural Network Policy

• Input: 4-dim observation
• 2 hidden layers × 20 units
• Output: action logits → softmax

Training Details

• Episodes: 2,500
• Learning rate: 1e-3
• Discount γ: 0.99

DQN vs REINFORCE

Aspect	DQN	REINFORCE
Learns	Value function Q(s,a)	Policy π(a\|s) directly
Policy type	Deterministic (argmax)	Stochastic (sampling)
Update	Every step (TD)	End of episode (MC)
Replay buffer	✅ Yes	❌ No (on-policy)

Results

✅ Solved! Agent reliably reaches max reward (500) after training

Performance Metrics:

• Initial: Episode returns close to 0
• Final: Episode returns reliably reaching 500
• Training: 2,500 episodes, 2 learning steps per episode

Key Takeaways

✅ Policy gradients can solve CartPole without learning Q-values
✅ Stochastic policy provides natural exploration
✅ Monte Carlo returns require full episodes (no bootstrapping)
✅ Simpler implementation than DQN (no replay buffer, target network)

← Back to Journey View on GitHub