🍇 Research

VinePPO / CAL-GRPO

2025 — Fine-Grained Credit Assignment for RL Training

RLHF Credit Assignment JAX/Tunix TPU

About

Implementation of Fine-Grained Credit Assignment (CAL) for RL training on Google's Tunix framework. Instead of coarse sequence-level rewards, CAL applies negative rewards surgically to only the tokens that caused errors — reducing variance and achieving faster, more stable training.

View on GitHub →

The Core Problem

Standard RLHF (PPO, GRPO)

  • • Entire response gets a single "good" or "bad" score
  • • A mostly-correct answer with one error is penalized uniformly
  • • Model can't learn which specific tokens caused the error
  • • High variance, training instability

CAL Approach

  • • Token-level credit assignment using LLM oracles (GPT-4, Gemini)
  • • Only error-causing tokens receive negative feedback
  • • Low variance — precise learning signal
  • • Stable KL divergence (train_kl < 0.01 vs > 100 in standard PPO)

The CAL Pipeline

  1. 1
    Generation

    Model generates responses to prompts

  2. 2
    Error Detection

    CAL Oracle (GPT-4/Gemini) identifies where errors occur

  3. 3
    Token Mapping

    Error text spans mapped to specific token positions

  4. 4
    Sparse Rewards

    Only error-causing tokens receive negative rewards

  5. 5
    Advantage Calculation

    Fine-grained advantages for GRPO policy updates

Expected Results (GSM8K)

Baseline

38.5%

CAL-GRPO

42.3%

Improvement

+3.8%

# Run experiments

./run_experiments.sh 100 4    # Proof of concept
python compare_results.py     # Compare baseline vs CAL

Key Takeaways

  • Token-level credit assignment beats sequence-level rewards
  • Lower variance — only incorrect tokens receive feedback
  • Stable training — KL divergence stays controlled
  • JAX/Tunix — native TPU support with optimized performance