Reading time: 15 minutes • Implementation time: 2-3 days • Difficulty: Advanced
Overview
GRPO (Group Relative Policy Optimization) is the core algorithm for training ATLAS teacher models. This guide walks through the complete training pipeline from SFT warmup to full RL optimization.
Prerequisites
- 4-8 H100 or A100 GPUs (40GB+ VRAM each)
- CUDA 11.8+
- Python 3.8+
- 100GB+ disk space for checkpoints
- Weights & Biases account (optional but recommended)
Training Pipeline
1
Environment Setup
Install dependencies and configure the environment:
Flash Attention 2 significantly improves training speed and memory efficiency. Install it if your GPU supports it (Ampere or newer).
2
SFT Warmup
Start with supervised fine-tuning to establish base capabilities:Key parameters:
num_train_epochs
: 1-2 epochs typically sufficientlearning_rate
: Start with 2e-5, adjust based on lossgradient_accumulation_steps
: Increase for larger effective batch size
SFT is critical for GRPO success. Ensure loss converges before proceeding.
3
Launch vLLM Server
Start the inference server for distributed generation:Server configuration options:
4
Run GRPO Training
Execute the main RL training with the SFT checkpoint:The script automatically:
- Distributes training across 4 GPUs
- Uses 4 GPUs for vLLM generation
- Manages distributed communication
- Handles checkpointing
The first number (4) is training GPUs, second (4) is inference GPUs. Adjust based on your hardware.
5
Monitor Training
Track key metrics during training:Real-time monitoring script:
6
Evaluate Model
Test the trained model on validation data:Expected metrics:
- Improvement rate: >15%
- Non-degradation: >95%
- Token efficiency: <250 tokens average
Configuration Deep Dive
GRPO Hyperparameters
Critical parameters for successful training:Reward Function Configuration
Customize rewards for your use case:Advanced Training Techniques
Curriculum Learning
Implement progressive difficulty:Mixed Precision Training
Enable for faster training:Gradient Accumulation Strategy
Optimize for your GPU memory:Troubleshooting
CUDA Out of Memory
CUDA Out of Memory
Problem: OOM errors during trainingSolutions:
Reward Collapse
Reward Collapse
Problem: Rewards go to zero or negativeSolutions:
vLLM Server Issues
vLLM Server Issues
Problem: Connection refused or timeoutSolutions:
Slow Training
Slow Training
Problem: Training is slower than expectedSolutions: