Documentation Index Fetch the complete documentation index at: https://docs.arc.computer/llms.txt
Use this file to discover all available pages before exploring further.
Common Issues
This guide covers frequent problems and their solutions when working with ATLAS.
Installation Issues
CUDA Not Available
Problem: torch.cuda.is_available() returns False
Solutions:
Verify CUDA Installation
nvidia-smi
nvcc --version
If not found, install CUDA 11.8+ from NVIDIA
Reinstall PyTorch with CUDA
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Check GPU Compatibility
Ensure GPU compute capability ≥7.0: import torch
torch.cuda.get_device_capability()
Flash Attention Build Failure
Problem: pip install flash-attn fails with compilation errors
Solutions:
Ensure CUDA toolkit matches PyTorch version
Install with pre-built wheels:
pip install flash-attn --no-build-isolation
Skip Flash Attention (with performance impact):
attn_implementation = "eager" # Instead of "flash_attention_2"
Hugging Face Access Issues
Problem: Can’t download models from Hugging Face
Solutions:
# Login to Hugging Face
huggingface-cli login
# Set cache directory if disk space limited
export HF_HOME = / path / to / cache
# Use offline mode if downloaded
export HF_DATASETS_OFFLINE = 1
export TRANSFORMERS_OFFLINE = 1
Memory Issues
CUDA Out of Memory
Problem: RuntimeError: CUDA out of memory
Progressive Solutions:
config.per_device_train_batch_size = 1
config.gradient_accumulation_steps = 32 # Maintain effective batch size
Enable Gradient Checkpointing
config.gradient_checkpointing = True
# Trades compute for memory
config.fp16 = True # or bf16 = True for A100/H100
# 8-bit quantization
config.load_in_8bit = True
# 4-bit quantization
config.load_in_4bit = True
config.bnb_4bit_compute_dtype = torch.float16
config.offload = True
# Or with DeepSpeed via Accelerate
# accelerate launch --config_file accelerate/deepspeed_zero3_cpu_offloading.yaml -m atlas_core.cli.train ...
Memory Calculation
Estimate memory requirements:
def estimate_memory_gb ( model_params_b , batch_size , seq_length ):
"""Rough memory estimate for training"""
# Model weights
weights_gb = model_params_b * 2 / 1024 # FP16
# Activations (rough estimate)
activations_gb = (batch_size * seq_length * 8192 * 4 ) / 1e9
# Gradients and optimizer states
optimizer_gb = weights_gb * 4 # Adam optimizer
total_gb = weights_gb + activations_gb + optimizer_gb
return total_gb
# Example: 8B model
memory_needed = estimate_memory_gb( 8 , batch_size = 4 , seq_length = 2048 )
print ( f "Estimated memory: { memory_needed :.1f} GB" )
Training Issues
Loss Not Decreasing
Problem: Training loss plateaus or increases
Diagnostic Steps:
# Check learning rate
print ( f "Current LR: { trainer.optimizer.param_groups[ 0 ][ 'lr' ] } " )
# Verify data loading
sample = next ( iter (train_dataloader))
print ( f "Input shape: { sample[ 'input_ids' ].shape } " )
print ( f "Labels present: { 'labels' in sample } " )
# Check gradient flow
for name, param in model.named_parameters():
if param.grad is not None :
print ( f " { name } : grad_norm= { param.grad.norm().item() :.4f} " )
Solutions:
Reduce learning rate: learning_rate=1e-6
Increase warmup: warmup_ratio=0.2
Check for data issues: duplicates, incorrect labels
Verify loss function: ensure proper masking
NaN or Inf Loss
Problem: Loss becomes NaN or Inf
Solutions:
# Clip gradients more aggressively
config.max_grad_norm = 0.5
# Reduce learning rate
config.learning_rate = 1e-7
# Check for numerical instability
config.fp32 = True # Use full precision temporarily
# Add gradient debugging
def check_gradients ( model ):
for name, param in model.named_parameters():
if param.grad is not None :
if torch.isnan(param.grad).any():
print ( f "NaN gradient in { name } " )
if torch.isinf(param.grad).any():
print ( f "Inf gradient in { name } " )
Slow Training Speed
Problem: Training is slower than expected
Performance Optimizations:
# Enable torch.compile (PyTorch 2.0+)
model = torch.compile(model)
# Use Flash Attention
config.attn_implementation = "flash_attention_2"
# Optimize data loading
config.dataloader_num_workers = 4
config.dataloader_pin_memory = True
# Enable TF32 on Ampere GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
# Profile to find bottlenecks
from torch.profiler import profile, ProfilerActivity
with profile( activities = [ProfilerActivity. CPU , ProfilerActivity. CUDA ]) as prof:
trainer.train()
print (prof.key_averages().table( sort_by = "cuda_time_total" ))
Inference Issues
Slow Inference
Problem: Generation is too slow for production
Solutions:
# Use vLLM for faster inference
from vllm import LLM , SamplingParams
llm = LLM( model = "Arc-Intelligence/ATLAS-8B-Thinking" )
sampling_params = SamplingParams(
temperature = 0.7 ,
top_p = 0.95 ,
max_tokens = 512
)
# Batch processing
outputs = llm.generate(prompts, sampling_params)
# Or use torch.compile
model = torch.compile(model, mode = "reduce-overhead" )
# Enable KV cache
model.config.use_cache = True
Inconsistent Results
Problem: Different results on each run
Solutions:
# Set seeds for reproducibility
import random
import numpy as np
import torch
def set_seed ( seed = 42 ):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
set_seed( 42 )
# Use deterministic algorithms
torch.use_deterministic_algorithms( True )
torch.backends.cudnn.benchmark = False
# Set temperature to 0 for deterministic generation
generation_kwargs = {
"temperature" : 0.0 ,
"do_sample" : False
}
vLLM Server Issues
Server Won’t Start
Problem: vLLM server fails to launch
Diagnostic Commands:
# Check if port is in use
lsof -i :8000
# Test with smaller model
python -m vllm.entrypoints.openai.api_server \
--model facebook/opt-125m \
--port 8001
# Check GPU memory
nvidia-smi
# Verify vLLM installation
python -c "import vllm; print(vllm.__version__)"
Solutions:
Reduce --gpu-memory-utilization 0.8
Use smaller --max-model-len 1024
Enable --enable-prefix-caching
Try different port
Connection Refused
Problem: Can’t connect to vLLM server
Solutions:
# Check server is running
ps aux | grep vllm
# Test connection
curl http://localhost:8000/v1/models
# Check firewall
sudo ufw status
# Use correct URL in code
vllm_url = "http://localhost:8000/v1" # Not https
SDK Runtime Issues
The following sections cover common issues when using the Atlas SDK for runtime orchestration and continual learning.
SDK Installation Issues
Python Version Mismatch
Problem: ImportError or crashes at import time
Solutions:
# Check Python version
python --version # Should be 3.10+
# Create virtual environment with correct version
python3.12 -m venv .venv
source .venv/bin/activate
# Reinstall SDK
pip install --upgrade arc-atlas
Note: Python 3.9 and earlier are not supported. Use 3.10+ (3.13 recommended).
Package Import Errors
Problem: ModuleNotFoundError after installation
Solutions:
# Ensure you're in the correct environment
which python
which pip
# Reinstall in current environment
pip uninstall arc-atlas -y
pip install arc-atlas
# Verify installation
python -c "import atlas; print(atlas.__version__)"
API Configuration
Missing API Keys
Problem: API key not found or authentication errors
Solutions:
Using Environment Variables
# Export directly
export OPENAI_API_KEY = "sk-..."
export GEMINI_API_KEY = "..."
export ANTHROPIC_API_KEY = "..."
# Verify
echo $OPENAI_API_KEY
# Create .env file in project root
cat > .env << EOF
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
ANTHROPIC_API_KEY=...
EOF
# SDK auto-loads .env files
atlas run --config config.yaml --task "Your task"
# config.yaml
agent :
llm :
provider : openai
api_key_env : OPENAI_API_KEY # Must be set in environment
Multi-Provider Configuration
Problem: Using multiple LLM providers in one config
Solution:
# Student uses OpenAI
agent :
llm :
provider : openai
model : gpt-4.1-mini
api_key_env : OPENAI_API_KEY
# Teacher uses OpenAI
teacher :
llm :
provider : openai
model : gpt-4.1
api_key_env : OPENAI_API_KEY
# Reward system uses Gemini
rim :
small_model :
provider : gemini
model : gemini/gemini-2.5-flash
api_key_env : GEMINI_API_KEY
large_model :
provider : gemini
model : gemini/gemini-2.5-pro
api_key_env : GEMINI_API_KEY
Storage & Database
Docker Daemon Not Running
Problem: Cannot connect to Docker daemon
Solutions:
# macOS
open -a Docker
# Linux - check status
sudo systemctl status docker
# Linux - start Docker
sudo systemctl start docker
# Verify
docker ps
Postgres Connection Failures
Problem: could not connect to server
Diagnostic Steps:
# Check if Postgres is running
docker ps | grep postgres
# Check if port is accessible
lsof -i :5433
# Test connection
psql postgresql://atlas:atlas@localhost:5433/atlas -c "SELECT 1"
Solutions:
Start Postgres with atlas init
atlas init
# Starts bundled Docker + Postgres on localhost:5433
Verify Connection
# Check container is running
docker ps --filter "name=atlas"
# Test connection
docker exec -it $( docker ps -q -f name=atlas ) \
psql -U atlas -c "SELECT version()"
Check Config
# config.yaml
storage :
database_url : postgresql://atlas:atlas@localhost:5433/atlas
Port Conflicts
Problem: Port 5433 is already in use
Solutions:
# Find process using port
lsof -i :5433
# Kill process if needed
kill -9 < PI D >
# Or use different port in config
storage:
database_url: postgresql://atlas:atlas@localhost:5434/atlas
Running Without Storage
Problem: Want to run SDK without Postgres
Solution:
Storage is optional. The SDK will run without persistent storage, but rewards and learning history won’t be saved:
# config.yaml - omit storage section entirely
agent :
type : litellm
# ... rest of config
# Storage section is optional
# storage:
# database_url: postgresql://...
Sessions still save to .atlas/runs/ as JSON files without database persistence.
Discovery Issues
atlas env init Finds Nothing
Problem: No agent classes detected
Solutions:
# Discovery looks for:
# - LangChain agents (from langchain import *)
# - LangGraph graphs (@graph decorator)
# - Custom agent classes
# Ensure your code imports these libraries
grep -r "from langchain" .
grep -r "from langgraph" .
# config.yaml
agent :
type : python
import_path : your_module.agents
attribute : create_agent
# Or for LangGraph
type : langgraph
import_path : your_module.graph
attribute : workflow
Check Virtual Environment
# Ensure correct environment is active
which python
pip list | grep langchain
# Discovery runs in your current environment
atlas env init --verbose
Wrong Class Detected
Problem: Discovery picks the wrong agent
Solution:
Override auto-discovery with explicit config:
# config.yaml
agent :
type : python
name : my-specific-agent
import_path : my_package.agents
attribute : production_agent # Specific function/class name
system_prompt : |
Custom prompt for this agent
Factory Synthesis Failures
Problem: Generated factory code fails
Solutions:
# Check generated factory
cat .atlas/generated_factories.py
# Validate it loads
python -c "from .atlas.generated_factories import *"
# Regenerate if needed
rm -rf .atlas/
atlas env init
# Or skip auto-discovery and use explicit config
Runtime Errors
LLM Provider Authentication
Problem: 401 Unauthorized or 403 Forbidden
Solutions:
# Verify API key is valid
curl https://api.openai.com/v1/models \
-H "Authorization: Bearer $OPENAI_API_KEY "
# Check key format
echo $OPENAI_API_KEY | grep -E "^sk-" # OpenAI
echo $GEMINI_API_KEY | grep -E "^AI" # Gemini (often starts with AI)
# Regenerate key if expired
# - OpenAI: https://platform.openai.com/api-keys
# - Anthropic: https://console.anthropic.com/keys
# - Gemini: https://aistudio.google.com/apikey
Timeout Errors
Problem: Request timeout during execution
Solutions:
# Increase timeouts in config
agent :
llm :
timeout_seconds : 180 # Default is 60
teacher :
llm :
timeout_seconds : 180
# Or for specific steps
orchestration :
step_timeout_seconds : 900 # 15 minutes
MCP Server Connection Issues
Problem: MCP tools not available or connection refused
Diagnostic Steps:
# Check MCP server is running
ps aux | grep mcp
# Test MCP endpoint
curl http://localhost:3000/health # Adjust port
# Check logs
tail -f ~/.mcp/logs/server.log
Solutions:
# Ensure MCP server is started before agent
import subprocess
mcp_process = subprocess.Popen([
"python" , "-m" , "your_mcp_server"
])
# Then run atlas
atlas run -- config config.yaml -- task "Your task"
Continual Learning Support
For issues specific to offline training, reward synthesis, or the learning engine, refer to the training-specific sections above. The Atlas SDK handles runtime orchestration and data collection, while Atlas Core handles offline model training from collected traces.
Getting Help
If these solutions don’t resolve your issue:
Check existing issues : GitHub Issues
Join community : Discord Server
File bug report with:
Error message and stack trace
System info: python -m torch.utils.collect_env
Minimal reproduction code
Configuration used
Next Steps
FAQ Frequently asked questions
Community Get help from community