Troubleshooting

Common Issues

This guide covers frequent problems and their solutions when working with ATLAS.

Installation Issues

CUDA Not Available

Problem: torch.cuda.is_available() returns False Solutions:

Verify CUDA Installation

nvidia-smi
nvcc --version

If not found, install CUDA 11.8+ from NVIDIA

Reinstall PyTorch with CUDA

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Check GPU Compatibility

Ensure GPU compute capability ≥7.0:

import torch
torch.cuda.get_device_capability()

Flash Attention Build Failure

Problem: pip install flash-attn fails with compilation errors Solutions:

Ensure CUDA toolkit matches PyTorch version

Install with pre-built wheels:

pip install flash-attn --no-build-isolation

Skip Flash Attention (with performance impact):

attn_implementation="eager"  # Instead of "flash_attention_2"

Hugging Face Access Issues

Problem: Can’t download models from Hugging Face Solutions:

# Login to Hugging Face
huggingface-cli login

# Set cache directory if disk space limited
export HF_HOME=/path/to/cache

# Use offline mode if downloaded
export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1

Memory Issues

CUDA Out of Memory

Problem: RuntimeError: CUDA out of memory Progressive Solutions:

Reduce Batch Size

config.per_device_train_batch_size = 1
config.gradient_accumulation_steps = 32  # Maintain effective batch size

Enable Gradient Checkpointing

config.gradient_checkpointing = True
# Trades compute for memory

Use Mixed Precision

config.fp16 = True  # or bf16 = True for A100/H100

Quantization

# 8-bit quantization
config.load_in_8bit = True

# 4-bit quantization
config.load_in_4bit = True
config.bnb_4bit_compute_dtype = torch.float16

CPU Offloading

config.offload = True
# Or with DeepSpeed
config.deepspeed = "configs/deepspeed/zero3_offload.json"

Memory Calculation

Estimate memory requirements:

def estimate_memory_gb(model_params_b, batch_size, seq_length):
    """Rough memory estimate for training"""
    # Model weights
    weights_gb = model_params_b * 2 / 1024  # FP16

    # Activations (rough estimate)
    activations_gb = (batch_size * seq_length * 8192 * 4) / 1e9

    # Gradients and optimizer states
    optimizer_gb = weights_gb * 4  # Adam optimizer

    total_gb = weights_gb + activations_gb + optimizer_gb
    return total_gb

# Example: 8B model
memory_needed = estimate_memory_gb(8, batch_size=4, seq_length=2048)
print(f"Estimated memory: {memory_needed:.1f} GB")

Training Issues

Loss Not Decreasing

Problem: Training loss plateaus or increases Diagnostic Steps:

# Check learning rate
print(f"Current LR: {trainer.optimizer.param_groups[0]['lr']}")

# Verify data loading
sample = next(iter(train_dataloader))
print(f"Input shape: {sample['input_ids'].shape}")
print(f"Labels present: {'labels' in sample}")

# Check gradient flow
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad_norm={param.grad.norm().item():.4f}")

Solutions:

Reduce learning rate: learning_rate=1e-6
Increase warmup: warmup_ratio=0.2
Check for data issues: duplicates, incorrect labels
Verify loss function: ensure proper masking

NaN or Inf Loss

Problem: Loss becomes NaN or Inf Solutions:

# Clip gradients more aggressively
config.max_grad_norm = 0.5

# Reduce learning rate
config.learning_rate = 1e-7

# Check for numerical instability
config.fp32 = True  # Use full precision temporarily

# Add gradient debugging
def check_gradients(model):
    for name, param in model.named_parameters():
        if param.grad is not None:
            if torch.isnan(param.grad).any():
                print(f"NaN gradient in {name}")
            if torch.isinf(param.grad).any():
                print(f"Inf gradient in {name}")

Slow Training Speed

Problem: Training is slower than expected Performance Optimizations:

# Enable torch.compile (PyTorch 2.0+)
model = torch.compile(model)

# Use Flash Attention
config.attn_implementation = "flash_attention_2"

# Optimize data loading
config.dataloader_num_workers = 4
config.dataloader_pin_memory = True

# Enable TF32 on Ampere GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Profile to find bottlenecks
from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    trainer.train()
print(prof.key_averages().table(sort_by="cuda_time_total"))

Inference Issues

Slow Inference

Problem: Generation is too slow for production Solutions:

# Use vLLM for faster inference
from vllm import LLM, SamplingParams

llm = LLM(model="Arc-Intelligence/ATLAS-8B-Thinking")
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512
)

# Batch processing
outputs = llm.generate(prompts, sampling_params)

# Or use torch.compile
model = torch.compile(model, mode="reduce-overhead")

# Enable KV cache
model.config.use_cache = True

Inconsistent Results

Problem: Different results on each run Solutions:

# Set seeds for reproducibility
import random
import numpy as np
import torch

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed(42)

# Use deterministic algorithms
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.benchmark = False

# Set temperature to 0 for deterministic generation
generation_kwargs = {
    "temperature": 0.0,
    "do_sample": False
}

vLLM Server Issues

Server Won’t Start

Problem: vLLM server fails to launch Diagnostic Commands:

# Check if port is in use
lsof -i :8000

# Test with smaller model
python -m vllm.entrypoints.openai.api_server \
  --model facebook/opt-125m \
  --port 8001

# Check GPU memory
nvidia-smi

# Verify vLLM installation
python -c "import vllm; print(vllm.__version__)"

Solutions:

Reduce --gpu-memory-utilization 0.8
Use smaller --max-model-len 1024
Enable --enable-prefix-caching
Try different port

Connection Refused

Problem: Can’t connect to vLLM server Solutions:

# Check server is running
ps aux | grep vllm

# Test connection
curl http://localhost:8000/v1/models

# Check firewall
sudo ufw status

# Use correct URL in code
vllm_url = "http://localhost:8000/v1"  # Not https

SDK Runtime Issues

The following sections cover common issues when using the Atlas SDK for runtime orchestration and continual learning.

SDK Installation Issues

Python Version Mismatch

Problem: ImportError or crashes at import time Solutions:

# Check Python version
python --version  # Should be 3.10+

# Create virtual environment with correct version
python3.12 -m venv .venv
source .venv/bin/activate

# Reinstall SDK
pip install --upgrade arc-atlas

Note: Python 3.9 and earlier are not supported. Use 3.10+ (3.13 recommended).

Package Import Errors

Problem: ModuleNotFoundError after installation Solutions:

# Ensure you're in the correct environment
which python
which pip

# Reinstall in current environment
pip uninstall arc-atlas -y
pip install arc-atlas

# Verify installation
python -c "import atlas; print(atlas.__version__)"

API Configuration

Missing API Keys

Problem: API key not found or authentication errors Solutions:

Using Environment Variables

# Export directly
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export ANTHROPIC_API_KEY="..."

# Verify
echo $OPENAI_API_KEY

Using .env File

# Create .env file in project root
cat > .env << EOF
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
ANTHROPIC_API_KEY=...
EOF

# SDK auto-loads .env files
atlas run --config config.yaml --task "Your task"

Config File Reference

# config.yaml
agent:
  llm:
    provider: openai
    api_key_env: OPENAI_API_KEY  # Must be set in environment

Multi-Provider Configuration

Problem: Using multiple LLM providers in one config Solution:

# Student uses OpenAI
agent:
  llm:
    provider: openai
    model: gpt-4.1-mini
    api_key_env: OPENAI_API_KEY

# Teacher uses OpenAI
teacher:
  llm:
    provider: openai
    model: gpt-4.1
    api_key_env: OPENAI_API_KEY

# Reward system uses Gemini
rim:
  small_model:
    provider: gemini
    model: gemini/gemini-2.5-flash
    api_key_env: GEMINI_API_KEY
  large_model:
    provider: gemini
    model: gemini/gemini-2.5-pro
    api_key_env: GEMINI_API_KEY

Storage & Database

Docker Daemon Not Running

Problem: Cannot connect to Docker daemon Solutions:

# macOS
open -a Docker

# Linux - check status
sudo systemctl status docker

# Linux - start Docker
sudo systemctl start docker

# Verify
docker ps

Postgres Connection Failures

Problem: could not connect to server Diagnostic Steps:

# Check if Postgres is running
docker ps | grep postgres

# Check if port is accessible
lsof -i :5433

# Test connection
psql postgresql://atlas:atlas@localhost:5433/atlas -c "SELECT 1"

Solutions:

Start Postgres with atlas init

atlas init
# Starts bundled Docker + Postgres on localhost:5433

Verify Connection

# Check container is running
docker ps --filter "name=atlas"

# Test connection
docker exec -it $(docker ps -q -f name=atlas) \
  psql -U atlas -c "SELECT version()"

Check Config

# config.yaml
storage:
  database_url: postgresql://atlas:atlas@localhost:5433/atlas

Port Conflicts

Problem: Port 5433 is already in use Solutions:

# Find process using port
lsof -i :5433

# Kill process if needed
kill -9 <PID>

# Or use different port in config
storage:
  database_url: postgresql://atlas:atlas@localhost:5434/atlas

Running Without Storage

Problem: Want to run SDK without Postgres Solution: Storage is optional. The SDK will run without persistent storage, but rewards and learning history won’t be saved:

# config.yaml - omit storage section entirely
agent:
  type: litellm
  # ... rest of config

# Storage section is optional
# storage:
#   database_url: postgresql://...

Sessions still save to .atlas/runs/ as JSON files without database persistence.

Discovery Issues

atlas env init Finds Nothing

Problem: No agent classes detected Solutions:

Check Project Structure

# Discovery looks for:
# - LangChain agents (from langchain import *)
# - LangGraph graphs (@graph decorator)
# - Custom agent classes

# Ensure your code imports these libraries
grep -r "from langchain" .
grep -r "from langgraph" .

Manually Specify Agent

# config.yaml
agent:
  type: python
  import_path: your_module.agents
  attribute: create_agent
  # Or for LangGraph
  type: langgraph
  import_path: your_module.graph
  attribute: workflow

Check Virtual Environment

# Ensure correct environment is active
which python
pip list | grep langchain

# Discovery runs in your current environment
atlas env init --verbose

Wrong Class Detected

Problem: Discovery picks the wrong agent Solution: Override auto-discovery with explicit config:

# config.yaml
agent:
  type: python
  name: my-specific-agent
  import_path: my_package.agents
  attribute: production_agent  # Specific function/class name
  system_prompt: |
    Custom prompt for this agent

Factory Synthesis Failures

Problem: Generated factory code fails Solutions:

# Check generated factory
cat .atlas/generated_factories.py

# Validate it loads
python -c "from .atlas.generated_factories import *"

# Regenerate if needed
rm -rf .atlas/
atlas env init

# Or skip auto-discovery and use explicit config

Runtime Errors

LLM Provider Authentication

Problem: 401 Unauthorized or 403 Forbidden Solutions:

# Verify API key is valid
curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY"

# Check key format
echo $OPENAI_API_KEY | grep -E "^sk-"  # OpenAI
echo $GEMINI_API_KEY | grep -E "^AI"   # Gemini (often starts with AI)

# Regenerate key if expired
# - OpenAI: https://platform.openai.com/api-keys
# - Anthropic: https://console.anthropic.com/keys
# - Gemini: https://aistudio.google.com/apikey

Timeout Errors

Problem: Request timeout during execution Solutions:

# Increase timeouts in config
agent:
  llm:
    timeout_seconds: 180  # Default is 60

teacher:
  llm:
    timeout_seconds: 180

# Or for specific steps
orchestration:
  step_timeout_seconds: 900  # 15 minutes

MCP Server Connection Issues

Problem: MCP tools not available or connection refused Diagnostic Steps:

# Check MCP server is running
ps aux | grep mcp

# Test MCP endpoint
curl http://localhost:3000/health  # Adjust port

# Check logs
tail -f ~/.mcp/logs/server.log

Solutions:

# Ensure MCP server is started before agent
import subprocess
mcp_process = subprocess.Popen([
    "python", "-m", "your_mcp_server"
])

# Then run atlas
atlas run --config config.yaml --task "Your task"

Continual Learning Support

For issues specific to offline training, reward synthesis, or the learning engine, refer to the training-specific sections above. The Atlas SDK handles runtime orchestration and data collection, while Atlas Core handles offline model training from collected traces.

Getting Help

If these solutions don’t resolve your issue:

Check existing issues: GitHub Issues
Join community: Discord Server
File bug report with:
- Error message and stack trace
- System info: python -m torch.utils.collect_env
- Minimal reproduction code
- Configuration used

Next Steps

FAQ

Frequently asked questions

Community

Get help from community

GitHub Issues

Report bugs

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

​Common Issues

​Installation Issues

​CUDA Not Available

​Flash Attention Build Failure

​Hugging Face Access Issues

​Memory Issues

​CUDA Out of Memory

​Memory Calculation

​Training Issues

​Loss Not Decreasing

​NaN or Inf Loss

​Slow Training Speed

​Inference Issues

​Slow Inference

​Inconsistent Results

​vLLM Server Issues

​Server Won’t Start

​Connection Refused

​SDK Runtime Issues

​SDK Installation Issues

​Python Version Mismatch

​Package Import Errors

​API Configuration

​Missing API Keys

​Multi-Provider Configuration

​Storage & Database

​Docker Daemon Not Running

​Postgres Connection Failures

​Port Conflicts

​Running Without Storage

​Discovery Issues

​atlas env init Finds Nothing

​Wrong Class Detected

​Factory Synthesis Failures

​Runtime Errors

​LLM Provider Authentication

​Timeout Errors

​MCP Server Connection Issues

​Continual Learning Support

​Getting Help

​Next Steps

FAQ

Community

GitHub Issues

Common Issues

Installation Issues

CUDA Not Available

Flash Attention Build Failure

Hugging Face Access Issues

Memory Issues

CUDA Out of Memory

Memory Calculation

Training Issues

Loss Not Decreasing

NaN or Inf Loss

Slow Training Speed

Inference Issues

Slow Inference

Inconsistent Results

vLLM Server Issues

Server Won’t Start

Connection Refused

SDK Runtime Issues

SDK Installation Issues

Python Version Mismatch

Package Import Errors

API Configuration

Missing API Keys

Multi-Provider Configuration

Storage & Database

Docker Daemon Not Running

Postgres Connection Failures

Port Conflicts

Running Without Storage

Discovery Issues

atlas env init Finds Nothing

Wrong Class Detected

Factory Synthesis Failures

Runtime Errors

LLM Provider Authentication

Timeout Errors

MCP Server Connection Issues

Continual Learning Support

Getting Help

Next Steps