Total time: 24-48 hours (mostly unattended training) • Active setup time: 30-45 minutes • Difficulty: Intermediate
TL;DR: Phase 1 (SFT) takes ~4-6 hours to establish a baseline teacher, Phase 2 (GRPO) runs ~24-36 hours with periodic monitoring, and validation at the end confirms the 15.7% accuracy lift before you promote the checkpoint.

Overview

This guide walks through a complete ATLAS training experiment, demonstrating the two-phase pipeline: supervised fine-tuning (SFT) followed by reinforcement learning with Group Relative Policy Optimization (GRPO).

Prerequisites

Before starting, ensure you have:
  • Hardware: 2× GPUs minimum (1 for vLLM, 1 for training), 4-8×H100 GPUs recommended (40GB+ VRAM each)
  • Environment: Python 3.11/3.12 with ATLAS dependencies installed (see Installation)
  • Authentication: HuggingFace token with access to Arc-Intelligence datasets
  • Storage: ~200GB for checkpoints and logs

Phase 1: SFT Warmup

The SFT phase establishes foundational reasoning capabilities before adaptive teaching training.

Configuration

# configs/run/teacher_sft.yaml
model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
dataset_name: Arc-Intelligence/Arc-ATLAS-Teach-v0
dataset_config: sft
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 2e-5
warmup_ratio: 0.1
output_dir: results/sft_checkpoint

Execution

# Minimum setup (2 GPUs)
scripts/launch.sh 2 configs/run/teacher_sft.yaml \
  output_dir=path/to/save/pre_rl_model

# Recommended setup (4 GPUs)
scripts/launch.sh 4 configs/run/teacher_sft.yaml \
  output_dir=path/to/save/pre_rl_model

# Full production setup (8 GPUs)
scripts/launch.sh 8 configs/run/teacher_sft.yaml \
  output_dir=path/to/save/pre_rl_model

# Memory-constrained with offloading
scripts/launch.sh 2 configs/run/teacher_sft.yaml \
  output_dir=path/to/save/pre_rl_model \
  +offload

Expected Metrics

MetricExpected RangeNotes
Training Loss1.2-1.5Should decrease monotonically
Gradient Norm<5.0Indicates stable training
GPU Memory70-80GBPer device with batch size 2
Duration4-6 hoursOn 8×H100 setup

Phase 2: GRPO Training

The RL phase trains adaptive teaching capabilities through policy gradient optimization.

Technical Background

GRPO implements the following objective function:
L = -E[r(y|x) * log π(y|x)] + β * KL(π || π_ref)
Where:
  • r(y|x): Reward function based on student performance improvement
  • π: Current policy
  • π_ref: Reference policy (SFT checkpoint)
  • β: KL divergence coefficient (default: 0.04)

Configuration

# configs/run/teacher_rcl.yaml
model_name_or_path: path/to/sft_checkpoint
dataset_name: Arc-Intelligence/Arc-ATLAS-Teach-v0
dataset_config: rl
num_generations: 32
temperature: 0.7
beta: 0.04
degradation_penalty_multiplier: 2.0
efficiency_weight: 0.3
max_steps: 1000
eval_steps: 100

Launch with vLLM Server

# Minimum setup (2 GPUs total: 1 for vLLM, 1 for training)
scripts/launch_with_server.sh 1 1 configs/run/teacher_rcl.yaml \
  model_name_or_path=path/of/saved/pre_rl_model

# Recommended setup (4 GPUs: 2 for vLLM, 2 for training)
scripts/launch_with_server.sh 2 2 configs/run/teacher_rcl.yaml \
  model_name_or_path=path/of/saved/pre_rl_model

# Full production setup (8 GPUs: 4 for vLLM, 4 for training)
scripts/launch_with_server.sh 4 4 configs/run/teacher_rcl.yaml \
  model_name_or_path=path/of/saved/pre_rl_model

# Monitor server health
curl http://localhost:8000/health

Key Parameters Explained

Number of response samples per prompt. Higher values improve gradient estimates but increase compute cost.
  • Default: 32
  • Range: 16-64
  • Trade-off: Quality vs. speed
Sampling temperature for generation. Controls exploration vs. exploitation.
  • Default: 0.7
  • Range: 0.5-1.0
  • Effect: Higher values increase diversity
KL divergence coefficient. Prevents policy collapse.
  • Default: 0.04
  • Range: 0.01-0.1
  • Warning: Too low causes instability
Penalty for responses worse than baseline.
  • Default: 2.0
  • Purpose: Ensures non-degradation guarantee
  • Formula: penalty = -multiplier * performance_drop

Monitoring Training Progress

Real-time Metrics

# TensorBoard visualization
tensorboard --logdir results/ --port 6006

# Weights & Biases (if configured)
# Dashboard available at wandb.ai/your-project

Critical Metrics to Track

MetricHealthy RangeWarning Signs
Reward MeanIncreasingPlateau or decrease
Non-degradation Rate>95%<90% indicates issues
KL Divergence0.5-2.0>5.0 suggests collapse
GPU Utilization>80%<50% indicates bottleneck
vLLM Throughput>1000 tok/s<500 tok/s needs optimization

Diagnostic Commands

# Check vLLM server status
curl http://localhost:8000/metrics

# Monitor GPU usage
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv -l 1

# Training process logs
tail -f results/train.log

Expected Outcomes

After successful completion:

Training Duration

  • 2× GPUs: 4-5 days
  • 4× GPUs: 2-3 days
  • 8× H100: 24-36 hours

Performance Metrics (from ATLAS Technical Report)

  • Teaching Efficiency: 15.7% average accuracy improvement
  • Non-degradation Rate: 97%
  • Token Efficiency: 50% reduction in response length (4k → 2k tokens)
  • Completion Rate: 31% improvement (69% → 100%)

Output Artifacts

results/
├── sft_checkpoint/          # Phase 1 model
│   ├── pytorch_model.bin
│   └── config.json
├── rl_checkpoint/           # Phase 2 model
│   ├── pytorch_model.bin
│   ├── config.json
│   └── trainer_state.json
├── logs/
│   ├── train.log
│   └── vllm_server.log
└── metrics/
    └── tensorboard_events

Validation

Verify model performance using the evaluation script:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load trained teacher model
teacher_model = AutoModelForCausalLM.from_pretrained(
    "results/rl_checkpoint",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
teacher_tokenizer = AutoTokenizer.from_pretrained(
    "results/rl_checkpoint"
)

# Load baseline student model
student_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507",
    torch_dtype=torch.float16,
    device_map="auto"
)
student_tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507"
)

# Test problem
problem = "Solve: A train travels 120 miles in 2 hours. What is its speed?"

# Get baseline student response
inputs = student_tokenizer(problem, return_tensors="pt").to(student_model.device)
baseline_output = student_model.generate(**inputs, max_new_tokens=100, temperature=0.7)
baseline_response = student_tokenizer.decode(baseline_output[0], skip_special_tokens=True)

# Get teacher-guided response (using ATLAS protocol)
# In production, use optimize_teaching.py for full protocol
teacher_inputs = teacher_tokenizer(
    f"Guide the student on: {problem}",
    return_tensors="pt"
).to(teacher_model.device)
teacher_output = teacher_model.generate(**teacher_inputs, max_new_tokens=200, temperature=0.7)
guidance = teacher_tokenizer.decode(teacher_output[0], skip_special_tokens=True)

print(f"Baseline: {baseline_response}")
print(f"Teacher guidance: {guidance}")

Troubleshooting

# Reduce batch size
per_device_train_batch_size: 1

# Enable gradient checkpointing
gradient_checkpointing: true

# Use CPU offloading (minimum 2 GPUs)
scripts/launch.sh 2 configs/run/teacher_sft.yaml +offload
# Check server logs
cat results/vllm_server.log

# Verify port availability
lsof -i :8000

# Restart with different port
vllm_port=8080 scripts/launch_with_server.sh 4 4 configs/run/teacher_rcl.yaml
  • Increase beta to strengthen KL constraint
  • Reduce temperature for more conservative sampling
  • Check dataset quality and reward function implementation

Next Steps

References