First Experiment

Total time: 24-48 hours (mostly unattended training) • Active setup time: 30-45 minutes • Difficulty: Intermediate

TL;DR: Phase 1 (SFT) takes ~4-6 hours to establish a baseline teacher, Phase 2 (GRPO) runs ~24-36 hours with periodic monitoring, and validation at the end confirms the 15.7% accuracy lift before you promote the checkpoint.

Overview

This guide walks through a complete ATLAS training experiment, demonstrating the two-phase pipeline: supervised fine-tuning (SFT) followed by reinforcement learning with Group Relative Policy Optimization (GRPO).

Prerequisites

Before starting, ensure you have:

Hardware: 2× GPUs minimum (1 for vLLM, 1 for training), 4-8×H100 GPUs recommended (40GB+ VRAM each)
Environment: Python 3.11/3.12 with ATLAS dependencies installed (see Installation)
Authentication: HuggingFace token with access to Arc-Intelligence datasets
Storage: ~200GB for checkpoints and logs

Phase 1: SFT Warmup

The SFT phase establishes foundational reasoning capabilities before adaptive teaching training.

Configuration

# configs/run/teacher_sft.yaml
model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
dataset_name: Arc-Intelligence/Arc-ATLAS-Teach-v0
dataset_config: sft
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 2e-5
warmup_ratio: 0.1
output_dir: results/sft_checkpoint

Execution

# Minimum setup (2 GPUs)
scripts/launch.sh 2 configs/run/teacher_sft.yaml \
  output_dir=path/to/save/pre_rl_model

# Recommended setup (4 GPUs)
scripts/launch.sh 4 configs/run/teacher_sft.yaml \
  output_dir=path/to/save/pre_rl_model

# Full production setup (8 GPUs)
scripts/launch.sh 8 configs/run/teacher_sft.yaml \
  output_dir=path/to/save/pre_rl_model

# Memory-constrained with offloading
scripts/launch.sh 2 configs/run/teacher_sft.yaml \
  output_dir=path/to/save/pre_rl_model \
  +offload

Expected Metrics

Metric	Expected Range	Notes
Training Loss	1.2-1.5	Should decrease monotonically
Gradient Norm	<5.0	Indicates stable training
GPU Memory	70-80GB	Per device with batch size 2
Duration	4-6 hours	On 8×H100 setup

Phase 2: GRPO Training

The RL phase trains adaptive teaching capabilities through policy gradient optimization.

Technical Background

GRPO implements the following objective function:

L = -E[r(y|x) * log π(y|x)] + β * KL(π || π_ref)

Where:

r(y|x): Reward function based on student performance improvement
π: Current policy
π_ref: Reference policy (SFT checkpoint)
β: KL divergence coefficient (default: 0.04)

Configuration

# configs/run/teacher_rcl.yaml
model_name_or_path: path/to/sft_checkpoint
dataset_name: Arc-Intelligence/Arc-ATLAS-Teach-v0
dataset_config: rl
num_generations: 32
temperature: 0.7
beta: 0.04
degradation_penalty_multiplier: 2.0
efficiency_weight: 0.3
max_steps: 1000
eval_steps: 100

Launch with vLLM Server

# Minimum setup (2 GPUs total: 1 for vLLM, 1 for training)
scripts/launch_with_server.sh 1 1 configs/run/teacher_rcl.yaml \
  model_name_or_path=path/of/saved/pre_rl_model

# Recommended setup (4 GPUs: 2 for vLLM, 2 for training)
scripts/launch_with_server.sh 2 2 configs/run/teacher_rcl.yaml \
  model_name_or_path=path/of/saved/pre_rl_model

# Full production setup (8 GPUs: 4 for vLLM, 4 for training)
scripts/launch_with_server.sh 4 4 configs/run/teacher_rcl.yaml \
  model_name_or_path=path/of/saved/pre_rl_model

# Monitor server health
curl http://localhost:8000/health

Key Parameters Explained

num_generations

Number of response samples per prompt. Higher values improve gradient estimates but increase compute cost.

Default: 32
Range: 16-64
Trade-off: Quality vs. speed

temperature

Sampling temperature for generation. Controls exploration vs. exploitation.

Default: 0.7
Range: 0.5-1.0
Effect: Higher values increase diversity

beta

KL divergence coefficient. Prevents policy collapse.

Default: 0.04
Range: 0.01-0.1
Warning: Too low causes instability

degradation_penalty_multiplier

Penalty for responses worse than baseline.

Default: 2.0
Purpose: Ensures non-degradation guarantee
Formula: penalty = -multiplier * performance_drop

Monitoring Training Progress

Real-time Metrics

# TensorBoard visualization
tensorboard --logdir results/ --port 6006

# Weights & Biases (if configured)
# Dashboard available at wandb.ai/your-project

Critical Metrics to Track

Metric	Healthy Range	Warning Signs
Reward Mean	Increasing	Plateau or decrease
Non-degradation Rate	>95%	<90% indicates issues
KL Divergence	0.5-2.0	>5.0 suggests collapse
GPU Utilization	>80%	<50% indicates bottleneck
vLLM Throughput	>1000 tok/s	<500 tok/s needs optimization

Diagnostic Commands

# Check vLLM server status
curl http://localhost:8000/metrics

# Monitor GPU usage
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv -l 1

# Training process logs
tail -f results/train.log

Expected Outcomes

After successful completion:

Training Duration

2× GPUs: 4-5 days
4× GPUs: 2-3 days
8× H100: 24-36 hours

Performance Metrics (from ATLAS Technical Report)

Teaching Efficiency: 15.7% average accuracy improvement
Non-degradation Rate: 97%
Token Efficiency: 50% reduction in response length (4k → 2k tokens)
Completion Rate: 31% improvement (69% → 100%)

Output Artifacts

results/
├── sft_checkpoint/          # Phase 1 model
│   ├── pytorch_model.bin
│   └── config.json
├── rl_checkpoint/           # Phase 2 model
│   ├── pytorch_model.bin
│   ├── config.json
│   └── trainer_state.json
├── logs/
│   ├── train.log
│   └── vllm_server.log
└── metrics/
    └── tensorboard_events

Validation

Verify model performance using the evaluation script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load trained teacher model
teacher_model = AutoModelForCausalLM.from_pretrained(
    "results/rl_checkpoint",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
teacher_tokenizer = AutoTokenizer.from_pretrained(
    "results/rl_checkpoint"
)

# Load baseline student model
student_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507",
    torch_dtype=torch.float16,
    device_map="auto"
)
student_tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507"
)

# Test problem
problem = "Solve: A train travels 120 miles in 2 hours. What is its speed?"

# Get baseline student response
inputs = student_tokenizer(problem, return_tensors="pt").to(student_model.device)
baseline_output = student_model.generate(**inputs, max_new_tokens=100, temperature=0.7)
baseline_response = student_tokenizer.decode(baseline_output[0], skip_special_tokens=True)

# Get teacher-guided response (using ATLAS protocol)
# In production, use optimize_teaching.py for full protocol
teacher_inputs = teacher_tokenizer(
    f"Guide the student on: {problem}",
    return_tensors="pt"
).to(teacher_model.device)
teacher_output = teacher_model.generate(**teacher_inputs, max_new_tokens=200, temperature=0.7)
guidance = teacher_tokenizer.decode(teacher_output[0], skip_special_tokens=True)

print(f"Baseline: {baseline_response}")
print(f"Teacher guidance: {guidance}")

Troubleshooting

OOM Errors During Training

# Reduce batch size
per_device_train_batch_size: 1

# Enable gradient checkpointing
gradient_checkpointing: true

# Use CPU offloading (minimum 2 GPUs)
scripts/launch.sh 2 configs/run/teacher_sft.yaml +offload

vLLM Server Connection Failed

# Check server logs
cat results/vllm_server.log

# Verify port availability
lsof -i :8000

# Restart with different port
vllm_port=8080 scripts/launch_with_server.sh 4 4 configs/run/teacher_rcl.yaml

Reward Collapse

Increase beta to strengthen KL constraint
Reduce temperature for more conservative sampling
Check dataset quality and reward function implementation

Next Steps

Online Optimization

Fine-tune your trained model for specific tasks

Deploy to Production

Integrate ATLAS into your inference pipeline

Custom Datasets

Train on your domain-specific data

Architecture Deep Dive

Understand the technical implementation

References

ATLAS Technical Report - Detailed methodology and ablations
GRPO Paper - Original GRPO algorithm
vLLM Documentation - Server configuration options

Getting Started

Core Concepts

Examples & Case Studies

Integration

Training

Benchmarks

Architecture

API Reference

Reference

First Experiment

Overview

Prerequisites

Phase 1: SFT Warmup

Configuration

Execution

Expected Metrics

Phase 2: GRPO Training

Technical Background

Configuration

Launch with vLLM Server

Key Parameters Explained

Monitoring Training Progress

Real-time Metrics

Critical Metrics to Track

Diagnostic Commands

Expected Outcomes

Training Duration

Performance Metrics (from ATLAS Technical Report)

Output Artifacts

Validation

Troubleshooting

Next Steps

Online Optimization

Deploy to Production

Custom Datasets

Architecture Deep Dive

References

Getting Started

Core Concepts

Examples & Case Studies

Integration

Training

Benchmarks

Architecture

API Reference

Reference

​Overview

​Prerequisites

​Phase 1: SFT Warmup

​Configuration

​Execution

​Expected Metrics

​Phase 2: GRPO Training

​Technical Background

​Configuration

​Launch with vLLM Server

​Key Parameters Explained

​Monitoring Training Progress

​Real-time Metrics

​Critical Metrics to Track

​Diagnostic Commands

​Expected Outcomes

​Training Duration

​Performance Metrics (from ATLAS Technical Report)

​Output Artifacts

​Validation

​Troubleshooting

​Next Steps

Online Optimization

Deploy to Production

Custom Datasets

Architecture Deep Dive

​References

Overview

Prerequisites

Phase 1: SFT Warmup

Configuration

Execution

Expected Metrics

Phase 2: GRPO Training

Technical Background

Configuration

Launch with vLLM Server

Key Parameters Explained

Monitoring Training Progress

Real-time Metrics

Critical Metrics to Track

Diagnostic Commands

Expected Outcomes

Training Duration

Performance Metrics (from ATLAS Technical Report)

Output Artifacts

Validation

Troubleshooting

Next Steps

References