Common Issues
This guide covers frequent problems and their solutions when working with ATLAS.Installation Issues
CUDA Not Available
Problem:torch.cuda.is_available() returns False
Solutions:
1
Verify CUDA Installation
2
Reinstall PyTorch with CUDA
3
Check GPU Compatibility
Ensure GPU compute capability ≥7.0:
Flash Attention Build Failure
Problem:pip install flash-attn fails with compilation errors
Solutions:
- Ensure CUDA toolkit matches PyTorch version
- Install with pre-built wheels:
- Skip Flash Attention (with performance impact):
Hugging Face Access Issues
Problem: Can’t download models from Hugging Face Solutions:Memory Issues
CUDA Out of Memory
Problem:RuntimeError: CUDA out of memory
Progressive Solutions:
Reduce Batch Size
Reduce Batch Size
Enable Gradient Checkpointing
Enable Gradient Checkpointing
Use Mixed Precision
Use Mixed Precision
Quantization
Quantization
CPU Offloading
CPU Offloading
Memory Calculation
Estimate memory requirements:Training Issues
Loss Not Decreasing
Problem: Training loss plateaus or increases Diagnostic Steps:- Reduce learning rate:
learning_rate=1e-6 - Increase warmup:
warmup_ratio=0.2 - Check for data issues: duplicates, incorrect labels
- Verify loss function: ensure proper masking
NaN or Inf Loss
Problem: Loss becomes NaN or Inf Solutions:Slow Training Speed
Problem: Training is slower than expected Performance Optimizations:Inference Issues
Slow Inference
Problem: Generation is too slow for production Solutions:Inconsistent Results
Problem: Different results on each run Solutions:vLLM Server Issues
Server Won’t Start
Problem: vLLM server fails to launch Diagnostic Commands:- Reduce
--gpu-memory-utilization 0.8 - Use smaller
--max-model-len 1024 - Enable
--enable-prefix-caching - Try different port
Connection Refused
Problem: Can’t connect to vLLM server Solutions:SDK Runtime Issues
The following sections cover common issues when using the Atlas SDK for runtime orchestration and continual learning.SDK Installation Issues
Python Version Mismatch
Problem:ImportError or crashes at import time
Solutions:
Package Import Errors
Problem:ModuleNotFoundError after installation
Solutions:
API Configuration
Missing API Keys
Problem:API key not found or authentication errors
Solutions:
Using Environment Variables
Using Environment Variables
Using .env File
Using .env File
Config File Reference
Config File Reference
Multi-Provider Configuration
Problem: Using multiple LLM providers in one config Solution:Storage & Database
Docker Daemon Not Running
Problem:Cannot connect to Docker daemon
Solutions:
Postgres Connection Failures
Problem:could not connect to server
Diagnostic Steps:
1
Start Postgres with atlas init
2
Verify Connection
3
Check Config
Port Conflicts
Problem:Port 5433 is already in use
Solutions:
Running Without Storage
Problem: Want to run SDK without Postgres Solution: Storage is optional. The SDK will run without persistent storage, but rewards and learning history won’t be saved:.atlas/runs/ as JSON files without database persistence.
Discovery Issues
atlas env init Finds Nothing
Problem:No agent classes detected
Solutions:
Check Project Structure
Check Project Structure
Manually Specify Agent
Manually Specify Agent
Check Virtual Environment
Check Virtual Environment
Wrong Class Detected
Problem: Discovery picks the wrong agent Solution: Override auto-discovery with explicit config:Factory Synthesis Failures
Problem: Generated factory code fails Solutions:Runtime Errors
LLM Provider Authentication
Problem:401 Unauthorized or 403 Forbidden
Solutions:
Timeout Errors
Problem:Request timeout during execution
Solutions:
MCP Server Connection Issues
Problem: MCP tools not available or connection refused Diagnostic Steps:Continual Learning Support
For issues specific to offline training, reward synthesis, or the learning engine, refer to the training-specific sections above. The Atlas SDK handles runtime orchestration and data collection, while Atlas Core handles offline model training from collected traces.Getting Help
If these solutions don’t resolve your issue:- Check existing issues: GitHub Issues
- Join community: Discord Server
- File bug report with:
- Error message and stack trace
- System info:
python -m torch.utils.collect_env - Minimal reproduction code
- Configuration used