Training Your Own Teacher Model

Time: 24-48 hours (mostly unattended) • Active setup: 30-45 minutes • Difficulty: Advanced

This is the advanced path. Most users should start with our pre-trained models.

Already collecting runtime traces? Stream them straight from Postgres with the runtime_pg data preset (+override /data@_global_: runtime_pg db_url=...) or stick with exported JSONL files (see Runtime Export Guide).Need to customise Hydra configs? See the Training Configuration guide for directory structure and override patterns.

Who Should Train Custom Models

You need custom training if you have:

Proprietary knowledge not available in public models
Domain-specific tasks where generic teaching doesn’t work well
Regulatory requirements that prevent using pre-trained models
Extreme performance needs where every percentage point matters

What you’ll need:

4-8 H100 or A100 GPUs (40GB+ VRAM each)
2-3 days of training time
Basic PyTorch and distributed training knowledge
~200GB disk space for checkpoints

Training Pipeline Overview

Runtime traces as the data source

Direct database access (SDK v0.1.13+) queries training sessions from PostgreSQL with reward-based filtering and selective data loading. This eliminates JSONL export intermediates and prevents schema drift:

from atlas.training_data import get_training_sessions

# Query high-quality sessions directly
sessions = get_training_sessions(
    db_url="postgresql://atlas:atlas@localhost:5433/atlas",
    min_reward=0.8,
    status_filters=["succeeded"],
    limit=10000
)

JSONL export (alternative method) is still supported for backward compatibility:

arc-atlas --database-url postgresql://... --output traces.jsonl

Reference the provided Hydra config for dataset loading:

# configs/data/runtime_traces.yaml
make_dataset_fn:
  _target_: custom_data.runtime_trace_data.get_runtime_trace_dataset
  export_path: traces/aime-batch.jsonl  # Or use direct database access
  eval_split_ratio: 0.1

Each session includes the triage dossier, adaptive summary (lane, confidence, certification flag, probe evidence), plan/step traces, persona usage/updates, reward breakdowns, and validation labels. See the Training Data Pipeline Guide for complete API reference and filtering options. The training happens in three clear steps: Step 1: SFT Warmup → Teach basic teaching patterns (4-6 hours) Step 2: Launch vLLM Server → Set up fast inference (5 minutes) Step 3: Run GRPO → Learn verifying-teacher behaviors for the dual-agent runtime through RL (24-36 hours) Each step has a clear goal and success criteria.

Step 1: SFT Warmup

Goal: Establish foundational teaching capabilities

What It Does

Supervised fine-tuning teaches the model basic teaching patterns through demonstration. Think of it like a student teacher observing an expert before trying it themselves.

Configuration Snapshot

# configs/run/teacher_sft.yaml
defaults:
  - _self_
  - override /model: llama3_8b
  - override /data: arc_atlas_sft
  - override /trainer: sft

learning_rate: 2e-5
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
warmup_ratio: 0.1
output_dir: checkpoints/sft

Run It

# Minimum (2 GPUs)
scripts/launch.sh 2 configs/run/teacher_sft.yaml \
  output_dir=checkpoints/sft

# Recommended (4 GPUs)
scripts/launch.sh 4 configs/run/teacher_sft.yaml \
  output_dir=checkpoints/sft

# Full production (8 GPUs)
scripts/launch.sh 8 configs/run/teacher_sft.yaml \
  output_dir=checkpoints/sft

# Memory-constrained with offloading
scripts/launch.sh 2 configs/run/teacher_sft.yaml \
  output_dir=checkpoints/sft \
  +offload

Success Looks Like

Metric	Target	What It Means
Training Loss	<1.5	Model is learning patterns
Gradient Norm	<5.0	Training is stable
Duration	4-6 hours	On 8× H100 GPUs

Monitor progress:

# Watch training logs
tail -f checkpoints/sft/training.log

# Check GPU usage
nvidia-smi -l 1

Step 2: Launch vLLM Server

Goal: Fast inference for RL training

What It Does

The vLLM server provides high-throughput generation during reinforcement learning. It runs on separate GPUs from the training process for maximum efficiency.

Run It

Atlas Core ships a lightweight launcher so you can spin up the inference stack without third-party tooling:

CUDA_VISIBLE_DEVICES=0,1 \
python trainers/vllm_server.py \
  --model checkpoints/sft/final \
  --port 8765 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.9

If you prefer a single command that starts the vLLM servers and the GRPO job together, stick with scripts/launch_with_server.sh <server_gpus> <training_gpus> configs/run/teacher_rcl.yaml .... That wrapper orchestrates trainers/vllm_server.py under the hood, waits for the health checks to pass, and then launches scripts/launch.sh for reinforcement learning.

Success Looks Like

# Server should respond
curl http://localhost:8000/v1/models

# Should return 200 with model info

Key parameters:

tensor-parallel-size: Number of GPUs for inference (match your hardware)
gpu-memory-utilization: How much VRAM to use (0.9 = 90%)
max-model-len: Maximum sequence length (2048 is good default)

Step 3: Run GRPO Training

Goal: Optimize teaching through reinforcement learning

What It Does

GRPO (Group Relative Policy Optimization) trains the teacher to actually improve student performance. The reward comes from measuring if students get better when taught.

Run It

# Minimum (2 GPUs: 1 training, 1 vLLM)
scripts/launch_with_server.sh 1 1 configs/run/teacher_rcl.yaml \
  model_name_or_path=checkpoints/sft/final

# Recommended (4 GPUs: 2 training, 2 vLLM)
scripts/launch_with_server.sh 2 2 configs/run/teacher_rcl.yaml \
  model_name_or_path=checkpoints/sft/final

# Production (8 GPUs: 4 training, 4 vLLM)
scripts/launch_with_server.sh 4 4 configs/run/teacher_rcl.yaml \
  model_name_or_path=checkpoints/sft/final

The first number is training GPUs, second is vLLM GPUs.

Success Looks Like

Metric	Healthy Range	Warning Sign
Reward Mean	Increasing	Plateau or decrease
Non-degradation Rate	>95%	<90% indicates issues
KL Divergence	0.5-2.0	>5.0 suggests collapse

Monitor with TensorBoard:

tensorboard --logdir checkpoints/grpo --port 6006

Key Configuration Parameters

You only need to understand 3-5 core parameters:

beta (KL divergence coefficient)

What it does: Controls how much the model can change from the original Default: 0.04 When to adjust:

Model changing too fast? → Increase to 0.06
Training too conservative? → Decrease to 0.02

temperature (sampling temperature)

What it does: Controls how creative the teaching is Default: 0.7 When to adjust:

Want more diverse teaching? → Increase to 0.9
Teaching too random? → Decrease to 0.5

learning_rate

What it does: How fast the model learns Default: 5e-7 (much smaller than SFT!) When to adjust:

Training too slow? → Try 1e-6
Rewards collapsing? → Decrease to 1e-7

Advanced: Reward System Configuration

GRPO uses the principle-based reward system. Point to the config:

# configs/trainer/reward/rim_teaching.yaml
teacher_reward:
  _target_: RIM.reward_adapter.RIMReward
  config_path: configs/rim_offline_config.yaml

The offline config focuses on helpfulness and process (not accuracy):

# configs/rim_offline_config.yaml
active_judges:
  accuracy: false      # Disable for offline training
  helpfulness: true    # Core reward signal
  process: true        # Rewards good reasoning
  diagnostic: false    # Disable for offline training

Monitor rim_rewards in logs to spot regressions.

Advanced: All Parameters

# Full configuration options
num_generations: 32              # Samples per prompt
max_new_tokens: 512             # Response length
top_p: 0.95                     # Nucleus sampling
warmup_ratio: 0.1               # LR warmup
weight_decay: 0.01              # L2 regularization
max_grad_norm: 1.0              # Gradient clipping
gradient_accumulation_steps: 4   # Effective batch size

Troubleshooting

CUDA Out of Memory

Problem: GPU runs out of memory during trainingQuick fixes:

# Reduce batch size
per_device_train_batch_size=1

# Enable gradient checkpointing
gradient_checkpointing=true

# Use DeepSpeed offloading
deepspeed=configs/deepspeed/zero2.json

Reward Collapse

Problem: Rewards go to zero or negativeQuick fixes:

# Increase KL penalty
beta=0.1

# Reduce learning rate
learning_rate=1e-7

# Check data quality
# Verify your training data has clear improvement signals

vLLM Connection Failed

Problem: Training can’t connect to serverQuick fixes:

# Check server is running
ps aux | grep vllm

# Verify port is open
lsof -i :8000

# Restart with more memory
--gpu-memory-utilization 0.95

Slow Training

Problem: Training slower than expectedQuick fixes:

# Enable Flash Attention 2
attn_implementation=flash_attention_2

# Use torch compile (PyTorch 2.0+)
torch_compile=true

# Optimize data loading
dataloader_num_workers=4

Expected Results

After successful training, you should see:

Performance Metrics

Teaching efficiency: 15.7% average accuracy improvement
Safety: 97% non-degradation rate
Token efficiency: 50% reduction in response length
Completion rate: 31% improvement (69% → 100%)

Training Duration

2 GPUs: 4-5 days
4 GPUs: 2-3 days
8 H100s: 24-36 hours

Output Artifacts

results/
├── sft_checkpoint/          # Phase 1 model
│   ├── pytorch_model.bin
│   └── config.json
├── rl_checkpoint/           # Phase 2 model (use this!)
│   ├── pytorch_model.bin
│   ├── config.json
│   └── trainer_state.json
└── logs/
    ├── train.log
    └── tensorboard_events

Validation

Test your trained model:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load your trained teacher
teacher = AutoModelForCausalLM.from_pretrained(
    "results/rl_checkpoint",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("results/rl_checkpoint")

# Load baseline student
student = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Test on a problem
problem = "A train travels 120 miles in 2 hours. What is its speed?"

# Get baseline (student only)
inputs = tokenizer(problem, return_tensors="pt").to(student.device)
baseline = student.generate(**inputs, max_new_tokens=100)
print(f"Baseline: {tokenizer.decode(baseline[0])}")

# Get teaching (using the atlas-sdk runtime loop)
# This gives you the enhanced response

Performance Optimization

Multi-Node Training

Scale across multiple machines:

# Node 1 (master)
torchrun \
  --nproc_per_node=8 \
  --nnodes=2 \
  --node_rank=0 \
  --master_addr=10.0.0.1 \
  train.py configs/run/teacher_rcl.yaml

# Node 2
torchrun \
  --nproc_per_node=8 \
  --nnodes=2 \
  --node_rank=1 \
  --master_addr=10.0.0.1 \
  train.py configs/run/teacher_rcl.yaml

DeepSpeed for Large Models

// configs/deepspeed/zero2.json
{
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {"device": "cpu"},
    "overlap_comm": true
  },
  "fp16": {"enabled": true},
  "gradient_clipping": 1.0
}

Use with:

scripts/launch.sh 8 configs/run/teacher_rcl.yaml \
  deepspeed=configs/deepspeed/zero2.json

Next Steps

SDK Runtime

Keep production agents improving between offline runs

Deploy to Production

Integrate your custom teacher into production

Reward System Deep Dive

Understand how teaching effectiveness is measured

References

Training Data Pipeline - Direct database access for training data
ATLAS Technical Report - Detailed methodology and ablations
GRPO Paper - Original algorithm
vLLM Documentation - Server configuration options
Export Runtime Traces - Direct database access and JSONL export methods
Quickstart - Collect runtime traces before training

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

​Who Should Train Custom Models

​Training Pipeline Overview

​Runtime traces as the data source

​Step 1: SFT Warmup

​What It Does

​Configuration Snapshot

​Run It

​Success Looks Like

​Step 2: Launch vLLM Server

​What It Does

​Run It

​Success Looks Like

​Step 3: Run GRPO Training

​What It Does

​Run It

​Success Looks Like

​Key Configuration Parameters

​beta (KL divergence coefficient)

​temperature (sampling temperature)

​learning_rate

​Troubleshooting

​Expected Results

​Performance Metrics

​Training Duration

​Output Artifacts

​Validation

​Performance Optimization

​Multi-Node Training

​DeepSpeed for Large Models

​Next Steps

SDK Runtime

Deploy to Production

Reward System Deep Dive

​References

Who Should Train Custom Models

Training Pipeline Overview

Runtime traces as the data source

Step 1: SFT Warmup

What It Does

Configuration Snapshot

Run It

Success Looks Like

Step 2: Launch vLLM Server

What It Does

Run It

Success Looks Like

Step 3: Run GRPO Training

What It Does

Run It

Success Looks Like

Key Configuration Parameters

beta (KL divergence coefficient)

temperature (sampling temperature)

learning_rate

Troubleshooting

Expected Results

Performance Metrics

Training Duration

Output Artifacts

Validation

Performance Optimization

Multi-Node Training

DeepSpeed for Large Models

Next Steps

References