🌙 AIMS Coursework

REINFORCE — LunarLander

Late 2024 — Policy Gradient for Complex Control

Policy Gradient REINFORCE LunarLander-v2 5000+ episodes

About

Applying the REINFORCE policy gradient algorithm to the challenging LunarLander-v2 environment. The agent must learn to control thrust in multiple directions to safely land on the target pad using only episodic returns for learning.

View on GitHub →

Environment: LunarLander-v2

8

State dims

4

Actions

5000+

Episodes

128

Batch size

Implementation Details

Key Components

  • Policy Network: FC network outputting action logits
  • Returns Calculation: Discounted Monte Carlo returns
  • Loss: Log-probability weighted by returns
  • Optimizer: Adam with learning rate adjustments

Challenge: LunarLander is harder than CartPole for REINFORCE due to sparse rewards and longer episodes. Required 5000+ episodes vs ~2500 for CartPole.

Results

REINFORCE LunarLander Training

Training curve showing gradual improvement as agent learns landing behavior

Training Progress:

  • Initial: Highly negative rewards (crashes)
  • Mid-training: Gradual improvement as agent explores
  • Final: Moderate stability with successful landings

REINFORCE vs DQN on LunarLander

  • DQN: Faster convergence (~700 episodes), higher final performance (250+)
  • REINFORCE: Slower (5000+ episodes), more variance, but simpler implementation
  • Why: REINFORCE uses full episode returns (high variance) while DQN uses TD learning with replay buffer (lower variance)

Key Takeaways

  • REINFORCE works but requires many more episodes than DQN
  • High variance from Monte Carlo returns slows learning
  • On-policy nature means no experience replay benefit
  • Future improvement: Add baseline (A2C) to reduce variance