🍇
Research
VinePPO / CAL-GRPO
2025 — Fine-Grained Credit Assignment for RL Training
RLHF
Credit Assignment
JAX/Tunix
TPU
About
Implementation of Fine-Grained Credit Assignment (CAL) for RL training on Google's Tunix framework. Instead of coarse sequence-level rewards, CAL applies negative rewards surgically to only the tokens that caused errors — reducing variance and achieving faster, more stable training.
View on GitHub →The Core Problem
Standard RLHF (PPO, GRPO)
- • Entire response gets a single "good" or "bad" score
- • A mostly-correct answer with one error is penalized uniformly
- • Model can't learn which specific tokens caused the error
- • High variance, training instability
CAL Approach
- • Token-level credit assignment using LLM oracles (GPT-4, Gemini)
- • Only error-causing tokens receive negative feedback
- • Low variance — precise learning signal
- • Stable KL divergence (train_kl < 0.01 vs > 100 in standard PPO)
The CAL Pipeline
-
1
Generation
Model generates responses to prompts
-
2
Error Detection
CAL Oracle (GPT-4/Gemini) identifies where errors occur
-
3
Token Mapping
Error text spans mapped to specific token positions
-
4
Sparse Rewards
Only error-causing tokens receive negative rewards
-
5
Advantage Calculation
Fine-grained advantages for GRPO policy updates
Expected Results (GSM8K)
Baseline
38.5%
CAL-GRPO
42.3%
Improvement
+3.8%
# Run experiments
./run_experiments.sh 100 4 # Proof of concept
python compare_results.py # Compare baseline vs CAL
Key Takeaways
- ✅ Token-level credit assignment beats sequence-level rewards
- ✅ Lower variance — only incorrect tokens receive feedback
- ✅ Stable training — KL divergence stays controlled
- ✅ JAX/Tunix — native TPU support with optimized performance