Common Issues
This guide covers frequent problems and their solutions when working with ATLAS.Installation Issues
CUDA Not Available
Problem:torch.cuda.is_available()
returns False
Solutions:
1
Verify CUDA Installation
2
Reinstall PyTorch with CUDA
3
Check GPU Compatibility
Ensure GPU compute capability ≥7.0:
Flash Attention Build Failure
Problem:pip install flash-attn
fails with compilation errors
Solutions:
- Ensure CUDA toolkit matches PyTorch version
- Install with pre-built wheels:
- Skip Flash Attention (with performance impact):
Hugging Face Access Issues
Problem: Can’t download models from Hugging Face Solutions:Memory Issues
CUDA Out of Memory
Problem:RuntimeError: CUDA out of memory
Progressive Solutions:
Reduce Batch Size
Reduce Batch Size
Enable Gradient Checkpointing
Enable Gradient Checkpointing
Use Mixed Precision
Use Mixed Precision
Quantization
Quantization
CPU Offloading
CPU Offloading
Memory Calculation
Estimate memory requirements:Training Issues
Loss Not Decreasing
Problem: Training loss plateaus or increases Diagnostic Steps:- Reduce learning rate:
learning_rate=1e-6
- Increase warmup:
warmup_ratio=0.2
- Check for data issues: duplicates, incorrect labels
- Verify loss function: ensure proper masking
NaN or Inf Loss
Problem: Loss becomes NaN or Inf Solutions:Slow Training Speed
Problem: Training is slower than expected Performance Optimizations:Inference Issues
Slow Inference
Problem: Generation is too slow for production Solutions:Inconsistent Results
Problem: Different results on each run Solutions:vLLM Server Issues
Server Won’t Start
Problem: vLLM server fails to launch Diagnostic Commands:- Reduce
--gpu-memory-utilization 0.8
- Use smaller
--max-model-len 1024
- Enable
--enable-prefix-caching
- Try different port
Connection Refused
Problem: Can’t connect to vLLM server Solutions:Online Optimization Issues
GEPA Not Converging
Problem: Online optimization shows no improvement Solutions:High API Costs
Problem: Online optimization exceeding budget Solutions:Getting Help
If these solutions don’t resolve your issue:- Check existing issues: GitHub Issues
- Join community: Discord Server
- File bug report with:
- Error message and stack trace
- System info:
python -m torch.utils.collect_env
- Minimal reproduction code
- Configuration used