Overview
This guide provides exact steps to reproduce the closed-loop +15.7% accuracy improvement and related metrics reported in our technical documentation. Once you reproduce the baseline, export the traces and run our offline GRPO pipeline to train a bespoke teacher checkpoint for your domain.Reproduction requires 4×H100 GPUs for full-scale training. For smaller-scale validation, see the Quick Validation section.
Environment Setup
Hardware Requirements
Full Reproduction
- 4×H100 80GB GPUs
- NVLink interconnect
- 128GB system RAM
- 500GB NVMe storage
Quick Validation
- 1×A100 40GB GPU
- 32GB system RAM
- 100GB storage
- ~4 hours runtime
Software Stack
Configuration Files
Key configuration files for reproduction:Full Reproduction Steps
1
Phase 1: SFT Warmup
Train the initial supervised fine-tuned model:Expected duration: 4-8 hours on 4×H100
Checkpoint size: ~16GB
Key metric: Loss < 0.5
2
Phase 2: GRPO Training
Run reinforcement learning with vLLM server:Expected duration: 24-48 hours on 4×H100
Key metrics:
- Reward > 0.5
- KL divergence < 10
- Non-degradation rate > 95%
3
Phase 3: Evaluation
Validate final performance with the lightweight Transformers snippet below (no additional repo files required):Expected results (closed-loop runtime + GRPO):
- Accuracy improvement: +15.7% ± 1.2%
- Completion rate: +31% ± 2%
- Non-degradation: ≥97%
- Token savings: ~50%
To continue beyond the baseline, export the traces with the SDK and launch
python scripts/run_offline_pipeline.py --export-path traces/runtime.jsonl to begin GRPO training.- Completion rate: ~100%
- Token reduction: ~50%
Quick Validation
For rapid testing without full training:Expected Metrics
After successful reproduction, you should observe:| Metric | Expected Value | Tolerance |
|---|---|---|
| Average accuracy gain (closed loop) | +15.7% | ±1.2% |
| Max improvement (closed loop) | +29.6% | ±2.1% |
| Completion rate | ~100% | ±2% |
| Token reduction | 50% | ±5% |
| Generation speedup | 13.6% | ±2% |
| Non-degradation rate | 97% | ±1% |
| Offline GRPO gain | Sustained lift from training on exported traces | Compute-bound |
Monitoring Training
Real-time Metrics
Key Indicators
- Healthy Training
- Issues to Watch
- GPU utilization > 90%
- Reward trending upward
- KL divergence stable (5-15)
- Loss decreasing smoothly
- No NaN/Inf values
Troubleshooting
CUDA Out of Memory
CUDA Out of Memory
vLLM Server Connection Failed
vLLM Server Connection Failed
Slow Training Speed
Slow Training Speed
Authentication Issues
Authentication Issues
Validation Snippets
Statistical Significance Test
Drop this snippet into any Python session (or save it astools/validate_significance.py) to compare baseline vs enhanced runs: