Total time: 24-48 hours (mostly unattended training) • Active setup time: 30-45 minutes • Difficulty: Intermediate
Overview
This guide walks through a complete ATLAS training experiment, demonstrating the two-phase pipeline: supervised fine-tuning (SFT) followed by reinforcement learning with Group Relative Policy Optimization (GRPO).Prerequisites
Before starting, ensure you have:- Hardware: 2× GPUs minimum (1 for vLLM, 1 for training), 4-8×H100 GPUs recommended (40GB+ VRAM each)
- Environment: Python 3.11/3.12 with ATLAS dependencies installed (see Installation)
- Authentication: HuggingFace token with access to Arc-Intelligence datasets
- Storage: ~200GB for checkpoints and logs
Phase 1: SFT Warmup
The SFT phase establishes foundational reasoning capabilities before adaptive teaching training.Configuration
Execution
Expected Metrics
Metric | Expected Range | Notes |
---|---|---|
Training Loss | 1.2-1.5 | Should decrease monotonically |
Gradient Norm | <5.0 | Indicates stable training |
GPU Memory | 70-80GB | Per device with batch size 2 |
Duration | 4-6 hours | On 8×H100 setup |
Phase 2: GRPO Training
The RL phase trains adaptive teaching capabilities through policy gradient optimization.Technical Background
GRPO implements the following objective function:r(y|x)
: Reward function based on student performance improvementπ
: Current policyπ_ref
: Reference policy (SFT checkpoint)β
: KL divergence coefficient (default: 0.04)
Configuration
Launch with vLLM Server
Key Parameters Explained
num_generations
num_generations
Number of response samples per prompt. Higher values improve gradient estimates but increase compute cost.
- Default: 32
- Range: 16-64
- Trade-off: Quality vs. speed
temperature
temperature
Sampling temperature for generation. Controls exploration vs. exploitation.
- Default: 0.7
- Range: 0.5-1.0
- Effect: Higher values increase diversity
beta
beta
KL divergence coefficient. Prevents policy collapse.
- Default: 0.04
- Range: 0.01-0.1
- Warning: Too low causes instability
degradation_penalty_multiplier
degradation_penalty_multiplier
Penalty for responses worse than baseline.
- Default: 2.0
- Purpose: Ensures non-degradation guarantee
- Formula:
penalty = -multiplier * performance_drop
Monitoring Training Progress
Real-time Metrics
Critical Metrics to Track
Metric | Healthy Range | Warning Signs |
---|---|---|
Reward Mean | Increasing | Plateau or decrease |
Non-degradation Rate | >95% | <90% indicates issues |
KL Divergence | 0.5-2.0 | >5.0 suggests collapse |
GPU Utilization | >80% | <50% indicates bottleneck |
vLLM Throughput | >1000 tok/s | <500 tok/s needs optimization |
Diagnostic Commands
Expected Outcomes
After successful completion:Training Duration
- 2× GPUs: 4-5 days
- 4× GPUs: 2-3 days
- 8× H100: 24-36 hours
Performance Metrics (from ATLAS Technical Report)
- Teaching Efficiency: 15.7% average accuracy improvement
- Non-degradation Rate: 97%
- Token Efficiency: 50% reduction in response length (4k → 2k tokens)
- Completion Rate: 31% improvement (69% → 100%)
Output Artifacts
Validation
Verify model performance using the evaluation script:Troubleshooting
OOM Errors During Training
OOM Errors During Training
vLLM Server Connection Failed
vLLM Server Connection Failed
Reward Collapse
Reward Collapse
- Increase
beta
to strengthen KL constraint - Reduce
temperature
for more conservative sampling - Check dataset quality and reward function implementation
Next Steps
Online Optimization
Fine-tune your trained model for specific tasks
Deploy to Production
Integrate ATLAS into your inference pipeline
Custom Datasets
Train on your domain-specific data
Architecture Deep Dive
Understand the technical implementation
References
- ATLAS Technical Report - Detailed methodology and ablations
- GRPO Paper - Original GRPO algorithm
- vLLM Documentation - Server configuration options