🍇 Research

VinePPO / CAL-GRPO

2025 — Fine-Grained Credit Assignment for RL Training

RLHF Credit Assignment JAX/Tunix TPU

About

Implementation of Fine-Grained Credit Assignment (CAL) for RL training on Google's Tunix framework. Instead of coarse sequence-level rewards, CAL applies negative rewards surgically to only the tokens that caused errors — reducing variance and achieving faster, more stable training.

View on GitHub →

The Core Problem

Standard RLHF (PPO, GRPO)

• Entire response gets a single "good" or "bad" score
• A mostly-correct answer with one error is penalized uniformly
• Model can't learn which specific tokens caused the error
• High variance, training instability

CAL Approach

• Token-level credit assignment using LLM oracles (GPT-4, Gemini)
• Only error-causing tokens receive negative feedback
• Low variance — precise learning signal
• Stable KL divergence (train_kl < 0.01 vs > 100 in standard PPO)

The CAL Pipeline

1
Generation
Model generates responses to prompts
2
Error Detection
CAL Oracle (GPT-4/Gemini) identifies where errors occur
3
Token Mapping
Error text spans mapped to specific token positions
4
Sparse Rewards
Only error-causing tokens receive negative rewards
5
Advantage Calculation
Fine-grained advantages for GRPO policy updates

Expected Results (GSM8K)

Baseline

38.5%

CAL-GRPO

42.3%

Improvement

+3.8%

# Run experiments

./run_experiments.sh 100 4    # Proof of concept
python compare_results.py     # Compare baseline vs CAL

Key Takeaways

✅ Token-level credit assignment beats sequence-level rewards
✅ Lower variance — only incorrect tokens receive feedback
✅ Stable training — KL divergence stays controlled
✅ JAX/Tunix — native TPU support with optimized performance

← Back to Journey View on GitHub