Skip to main content
Time: 24-48 hours (mostly unattended) • Active setup: 30-45 minutes • Difficulty: Advanced
This is the advanced path. Most users should start with our pre-trained models.
Already collecting runtime traces? Stream them straight from Postgres with the runtime_pg data preset (+override /data@_global_: runtime_pg db_url=...) or stick with exported JSONL files (see Runtime Export Guide).Need to customise Hydra configs? See the Training Configuration guide for directory structure and override patterns.

Who Should Train Custom Models

You need custom training if you have:
  • Proprietary knowledge not available in public models
  • Domain-specific tasks where generic teaching doesn’t work well
  • Regulatory requirements that prevent using pre-trained models
  • Extreme performance needs where every percentage point matters
What you’ll need:
  • 4-8 H100 or A100 GPUs (40GB+ VRAM each)
  • 2-3 days of training time
  • Basic PyTorch and distributed training knowledge
  • ~200GB disk space for checkpoints

Training Pipeline Overview

Training Pipeline

Runtime traces as the data source

Direct database access (SDK v0.1.13+) queries training sessions from PostgreSQL with reward-based filtering and selective data loading. This eliminates JSONL export intermediates and prevents schema drift:
from atlas.training_data import get_training_sessions

# Query high-quality sessions directly
sessions = get_training_sessions(
    db_url="postgresql://atlas:atlas@localhost:5433/atlas",
    min_reward=0.8,
    status_filters=["succeeded"],
    limit=10000
)
JSONL export (alternative method) is still supported for backward compatibility:
arc-atlas --database-url postgresql://... --output traces.jsonl
Reference the provided Hydra config for dataset loading:
# configs/data/runtime_traces.yaml
make_dataset_fn:
  _target_: custom_data.runtime_trace_data.get_runtime_trace_dataset
  export_path: traces/aime-batch.jsonl  # Or use direct database access
  eval_split_ratio: 0.1
Each session includes the triage dossier, adaptive summary (lane, confidence, certification flag, probe evidence), plan/step traces, persona usage/updates, reward breakdowns, and validation labels. See the Training Data Pipeline Guide for complete API reference and filtering options. The training happens in three clear steps: Step 1: SFT Warmup → Teach basic teaching patterns (4-6 hours) Step 2: Launch vLLM Server → Set up fast inference (5 minutes) Step 3: Run GRPO → Learn verifying-teacher behaviors for the dual-agent runtime through RL (24-36 hours) Each step has a clear goal and success criteria.

Step 1: SFT Warmup

Goal: Establish foundational teaching capabilities

What It Does

Supervised fine-tuning teaches the model basic teaching patterns through demonstration. Think of it like a student teacher observing an expert before trying it themselves.

Configuration Snapshot

# configs/run/teacher_sft.yaml
defaults:
  - _self_
  - override /model: llama3_8b
  - override /data: arc_atlas_sft
  - override /trainer: sft

learning_rate: 2e-5
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
warmup_ratio: 0.1
output_dir: checkpoints/sft

Run It

# Minimum (2 GPUs)
scripts/launch.sh 2 configs/run/teacher_sft.yaml \
  output_dir=checkpoints/sft

# Recommended (4 GPUs)
scripts/launch.sh 4 configs/run/teacher_sft.yaml \
  output_dir=checkpoints/sft

# Full production (8 GPUs)
scripts/launch.sh 8 configs/run/teacher_sft.yaml \
  output_dir=checkpoints/sft

# Memory-constrained with offloading
scripts/launch.sh 2 configs/run/teacher_sft.yaml \
  output_dir=checkpoints/sft \
  +offload

Success Looks Like

MetricTargetWhat It Means
Training Loss<1.5Model is learning patterns
Gradient Norm<5.0Training is stable
Duration4-6 hoursOn 8× H100 GPUs
Monitor progress:
# Watch training logs
tail -f checkpoints/sft/training.log

# Check GPU usage
nvidia-smi -l 1

Step 2: Launch vLLM Server

Goal: Fast inference for RL training

What It Does

The vLLM server provides high-throughput generation during reinforcement learning. It runs on separate GPUs from the training process for maximum efficiency.

Run It

Atlas Core ships a lightweight launcher so you can spin up the inference stack without third-party tooling:
CUDA_VISIBLE_DEVICES=0,1 \
python trainers/vllm_server.py \
  --model checkpoints/sft/final \
  --port 8765 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.9
If you prefer a single command that starts the vLLM servers and the GRPO job together, stick with scripts/launch_with_server.sh <server_gpus> <training_gpus> configs/run/teacher_rcl.yaml .... That wrapper orchestrates trainers/vllm_server.py under the hood, waits for the health checks to pass, and then launches scripts/launch.sh for reinforcement learning.

Success Looks Like

# Server should respond
curl http://localhost:8000/v1/models

# Should return 200 with model info
Key parameters:
  • tensor-parallel-size: Number of GPUs for inference (match your hardware)
  • gpu-memory-utilization: How much VRAM to use (0.9 = 90%)
  • max-model-len: Maximum sequence length (2048 is good default)

Step 3: Run GRPO Training

Goal: Optimize teaching through reinforcement learning

What It Does

GRPO (Group Relative Policy Optimization) trains the teacher to actually improve student performance. The reward comes from measuring if students get better when taught.

Run It

# Minimum (2 GPUs: 1 training, 1 vLLM)
scripts/launch_with_server.sh 1 1 configs/run/teacher_rcl.yaml \
  model_name_or_path=checkpoints/sft/final

# Recommended (4 GPUs: 2 training, 2 vLLM)
scripts/launch_with_server.sh 2 2 configs/run/teacher_rcl.yaml \
  model_name_or_path=checkpoints/sft/final

# Production (8 GPUs: 4 training, 4 vLLM)
scripts/launch_with_server.sh 4 4 configs/run/teacher_rcl.yaml \
  model_name_or_path=checkpoints/sft/final
The first number is training GPUs, second is vLLM GPUs.

Success Looks Like

MetricHealthy RangeWarning Sign
Reward MeanIncreasingPlateau or decrease
Non-degradation Rate>95%<90% indicates issues
KL Divergence0.5-2.0>5.0 suggests collapse
Monitor with TensorBoard:
tensorboard --logdir checkpoints/grpo --port 6006

Key Configuration Parameters

You only need to understand 3-5 core parameters:

beta (KL divergence coefficient)

What it does: Controls how much the model can change from the original Default: 0.04 When to adjust:
  • Model changing too fast? → Increase to 0.06
  • Training too conservative? → Decrease to 0.02

temperature (sampling temperature)

What it does: Controls how creative the teaching is Default: 0.7 When to adjust:
  • Want more diverse teaching? → Increase to 0.9
  • Teaching too random? → Decrease to 0.5

learning_rate

What it does: How fast the model learns Default: 5e-7 (much smaller than SFT!) When to adjust:
  • Training too slow? → Try 1e-6
  • Rewards collapsing? → Decrease to 1e-7
GRPO uses the principle-based reward system. Point to the config:
# configs/trainer/reward/rim_teaching.yaml
teacher_reward:
  _target_: RIM.reward_adapter.RIMReward
  config_path: configs/rim_offline_config.yaml
The offline config focuses on helpfulness and process (not accuracy):
# configs/rim_offline_config.yaml
active_judges:
  accuracy: false      # Disable for offline training
  helpfulness: true    # Core reward signal
  process: true        # Rewards good reasoning
  diagnostic: false    # Disable for offline training
Monitor rim_rewards in logs to spot regressions.
# Full configuration options
num_generations: 32              # Samples per prompt
max_new_tokens: 512             # Response length
top_p: 0.95                     # Nucleus sampling
warmup_ratio: 0.1               # LR warmup
weight_decay: 0.01              # L2 regularization
max_grad_norm: 1.0              # Gradient clipping
gradient_accumulation_steps: 4   # Effective batch size

Troubleshooting

Problem: GPU runs out of memory during trainingQuick fixes:
# Reduce batch size
per_device_train_batch_size=1

# Enable gradient checkpointing
gradient_checkpointing=true

# Use DeepSpeed offloading
deepspeed=configs/deepspeed/zero2.json
Problem: Rewards go to zero or negativeQuick fixes:
# Increase KL penalty
beta=0.1

# Reduce learning rate
learning_rate=1e-7

# Check data quality
# Verify your training data has clear improvement signals
Problem: Training can’t connect to serverQuick fixes:
# Check server is running
ps aux | grep vllm

# Verify port is open
lsof -i :8000

# Restart with more memory
--gpu-memory-utilization 0.95
Problem: Training slower than expectedQuick fixes:
# Enable Flash Attention 2
attn_implementation=flash_attention_2

# Use torch compile (PyTorch 2.0+)
torch_compile=true

# Optimize data loading
dataloader_num_workers=4

Expected Results

After successful training, you should see:

Performance Metrics

  • Teaching efficiency: 15.7% average accuracy improvement
  • Safety: 97% non-degradation rate
  • Token efficiency: 50% reduction in response length
  • Completion rate: 31% improvement (69% → 100%)

Training Duration

  • 2 GPUs: 4-5 days
  • 4 GPUs: 2-3 days
  • 8 H100s: 24-36 hours

Output Artifacts

results/
├── sft_checkpoint/          # Phase 1 model
│   ├── pytorch_model.bin
│   └── config.json
├── rl_checkpoint/           # Phase 2 model (use this!)
│   ├── pytorch_model.bin
│   ├── config.json
│   └── trainer_state.json
└── logs/
    ├── train.log
    └── tensorboard_events

Validation

Test your trained model:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load your trained teacher
teacher = AutoModelForCausalLM.from_pretrained(
    "results/rl_checkpoint",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("results/rl_checkpoint")

# Load baseline student
student = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Test on a problem
problem = "A train travels 120 miles in 2 hours. What is its speed?"

# Get baseline (student only)
inputs = tokenizer(problem, return_tensors="pt").to(student.device)
baseline = student.generate(**inputs, max_new_tokens=100)
print(f"Baseline: {tokenizer.decode(baseline[0])}")

# Get teaching (using the atlas-sdk runtime loop)
# This gives you the enhanced response

Performance Optimization

Multi-Node Training

Scale across multiple machines:
# Node 1 (master)
torchrun \
  --nproc_per_node=8 \
  --nnodes=2 \
  --node_rank=0 \
  --master_addr=10.0.0.1 \
  train.py configs/run/teacher_rcl.yaml

# Node 2
torchrun \
  --nproc_per_node=8 \
  --nnodes=2 \
  --node_rank=1 \
  --master_addr=10.0.0.1 \
  train.py configs/run/teacher_rcl.yaml

DeepSpeed for Large Models

// configs/deepspeed/zero2.json
{
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {"device": "cpu"},
    "overlap_comm": true
  },
  "fp16": {"enabled": true},
  "gradient_clipping": 1.0
}
Use with:
scripts/launch.sh 8 configs/run/teacher_rcl.yaml \
  deepspeed=configs/deepspeed/zero2.json

Next Steps

References