Reproduction Guide

Overview

This guide provides exact steps to reproduce the closed-loop +15.7% accuracy improvement and related metrics reported in our technical documentation. Once you reproduce the baseline, export the traces and run our offline GRPO pipeline to train a bespoke teacher checkpoint for your domain.

Reproduction requires 4×H100 GPUs for full-scale training. For smaller-scale validation, see the Quick Validation section.

Environment Setup

Hardware Requirements

Full Reproduction

4×H100 80GB GPUs
NVLink interconnect
128GB system RAM
500GB NVMe storage

Quick Validation

1×A100 40GB GPU
32GB system RAM
100GB storage
~4 hours runtime

Software Stack

# Python environment
python --version  # 3.11 or 3.12 required
pip install -r requirements.txt

# Verify CUDA
nvidia-smi
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# Authenticate with Hugging Face
huggingface-cli login

Configuration Files

Key configuration files for reproduction:

# configs/run/teacher_sft.yaml
model_name_or_path: Qwen/Qwen3-8B-Instruct-2507
dataset_id_or_path: Arc-Intelligence/Arc-ATLAS-Teach-v0
output_dir: results/pre_rl_model
seed: 42
num_train_epochs: 1

# configs/run/teacher_rcl.yaml
model_name_or_path: results/pre_rl_model
dataset_id_or_path: Arc-Intelligence/Arc-ATLAS-Teach-v0
num_generations: 32
seed: 42
beta: 0.04

Full Reproduction Steps

Phase 1: SFT Warmup

Train the initial supervised fine-tuned model:

scripts/launch.sh 4 configs/run/teacher_sft.yaml \
  dataset_id_or_path=Arc-Intelligence/Arc-ATLAS-Teach-v0 \
  output_dir=results/pre_rl_model \
  seed=42

Expected duration: 4-8 hours on 4×H100 Checkpoint size: ~16GB Key metric: Loss < 0.5

Phase 2: GRPO Training

Run reinforcement learning with vLLM server:

scripts/launch_with_server.sh 1 3 configs/run/teacher_rcl.yaml \
  model_name_or_path=results/pre_rl_model \
  dataset_id_or_path=Arc-Intelligence/Arc-ATLAS-Teach-v0 \
  num_generations=32 \
  seed=42 \
  beta=0.04

Expected duration: 24-48 hours on 4×H100 Key metrics:

Reward > 0.5
KL divergence < 10
Non-degradation rate > 95%

Phase 3: Evaluation

Validate final performance with the lightweight Transformers snippet below (no additional repo files required):

python - <<'PY'
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "results/final_model"
DATASET = "Arc-Intelligence/Arc-ATLAS-Teach-v0"
DATASET_CONFIG = "rl"
SAMPLES = 32

dataset = load_dataset(DATASET, DATASET_CONFIG, split="validation").shuffle(seed=42).select(range(SAMPLES))
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto")

correct = 0
for example in dataset:
    inputs = tokenizer(example["prompt"], return_tensors="pt").to(model.device)
    output = model.generate(**inputs, max_new_tokens=256)
    prediction = tokenizer.decode(output[0], skip_special_tokens=True)
    if example["ground_truth"].strip().lower() in prediction.lower():
        correct += 1

accuracy = correct / SAMPLES
print(f"Accuracy over {SAMPLES} samples: {accuracy:.2%}")
PY

Expected results (closed-loop runtime + GRPO):

Accuracy improvement: +15.7% ± 1.2%
Completion rate: +31% ± 2%
Non-degradation: ≥97%
Token savings: ~50%

To continue beyond the baseline, export the traces with the SDK and launch python scripts/run_offline_pipeline.py --export-path traces/runtime.jsonl to begin GRPO training.

Completion rate: ~100%
Token reduction: ~50%

Quick Validation

For rapid testing without full training:

# Download pre-trained checkpoint
huggingface-cli download Arc-Intelligence/ATLAS-8B-Thinking \
  --local-dir checkpoints/teacher

# Run minimal training (4 steps)
scripts/launch_with_server.sh 1 1 configs/run/teacher_rcl.yaml \
  model_name_or_path=checkpoints/teacher \
  max_steps=4 \
  eval_steps=1 \
  report_to=null

# Verify performance (reuse the evaluation snippet above with fewer samples)
python - <<'PY'
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "checkpoints/teacher"
DATASET = "Arc-Intelligence/Arc-ATLAS-Teach-v0"
DATASET_CONFIG = "rl"
SAMPLES = 16

dataset = load_dataset(DATASET, DATASET_CONFIG, split="validation").shuffle(seed=7).select(range(SAMPLES))
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto")

correct = 0
for example in dataset:
    inputs = tokenizer(example["prompt"], return_tensors="pt").to(model.device)
    prediction = model.generate(**inputs, max_new_tokens=256)
    decoded = tokenizer.decode(prediction[0], skip_special_tokens=True)
    if example["ground_truth"].strip().lower() in decoded.lower():
        correct += 1

print(f"Quick validation accuracy ({SAMPLES} samples): {correct / SAMPLES:.2%}")
PY

Expected Metrics

After successful reproduction, you should observe:

Metric	Expected Value	Tolerance
Average accuracy gain (closed loop)	+15.7%	±1.2%
Max improvement (closed loop)	+29.6%	±2.1%
Completion rate	~100%	±2%
Token reduction	50%	±5%
Generation speedup	13.6%	±2%
Non-degradation rate	97%	±1%
Offline GRPO gain	Sustained lift from training on exported traces	Compute-bound

Monitoring Training

Real-time Metrics

# TensorBoard monitoring
tensorboard --logdir results/ --port 6006

# vLLM server health
watch -n 5 'curl -s http://localhost:8765/metrics'

# GPU utilization
nvidia-smi dmon -s u -d 5

Key Indicators

Healthy Training
Issues to Watch

GPU utilization > 90%
Reward trending upward
KL divergence stable (5-15)
Loss decreasing smoothly
No NaN/Inf values

Troubleshooting

CUDA Out of Memory

# Add gradient checkpointing
scripts/launch_with_server.sh 1 3 configs/run/teacher_rcl.yaml \
  gradient_checkpointing=true \
  per_device_train_batch_size=1 \
  gradient_accumulation_steps=32

vLLM Server Connection Failed

# Check if port is in use
lsof -i :8765

# Use alternative port
scripts/launch_with_server.sh 1 3 configs/run/teacher_rcl.yaml \
  vllm_port=8766

Slow Training Speed

# Enable optimizations
export TORCH_COMPILE=1
export FLASH_ATTENTION=1

scripts/launch_with_server.sh 1 3 configs/run/teacher_rcl.yaml \
  tf32=true \
  dataloader_num_workers=4

Authentication Issues

# Re-login to Hugging Face
huggingface-cli logout
huggingface-cli login

# Verify access
huggingface-cli download Arc-Intelligence/ATLAS-8B-Thinking README.md

Validation Snippets

Statistical Significance Test

Drop this snippet into any Python session (or save it as tools/validate_significance.py) to compare baseline vs enhanced runs:

import numpy as np
from scipy import stats

def validate_improvement(baseline_file, enhanced_file):
    baseline = np.load(baseline_file)['accuracy']
    enhanced = np.load(enhanced_file)['accuracy']

    t_stat, p_value = stats.ttest_rel(enhanced, baseline)
    print(f"Improvement: {np.mean(enhanced - baseline):.3f}")
    print(f"P-value: {p_value:.6f}")
    print(f"Significant: {p_value < 0.001}")

Performance Verification Checklist

Use this helper to confirm the reproduced metrics stay within tolerance bands before sharing results:

def verify_benchmarks(metrics):
    expected = {
        'accuracy_gain': (0.157, 0.012),  # mean, tolerance
        'completion_rate': (1.0, 0.02),
        'token_reduction': (0.5, 0.05),
        'speed_gain': (0.136, 0.02)
    }

    for metric, (expected_val, tolerance) in expected.items():
        actual = metrics[metric]
        within_tolerance = abs(actual - expected_val) <= tolerance
        status = "PASS" if within_tolerance else "FAIL"
        print(f"{metric}: {actual:.3f} (expected {expected_val:.3f} ±{tolerance:.3f}) [{status}]")

Artifact Management

Required Outputs

Save these artifacts for verification:

results/
├── pre_rl_model/         # SFT checkpoint
├── final_model/          # GRPO checkpoint
├── eval_results.json     # Evaluation metrics
├── training_logs/        # TensorBoard logs
├── config_used.yaml      # Exact configuration
└── environment.txt       # pip freeze output

# Package for sharing
tar -czf atlas_reproduction.tar.gz \
  results/eval_results.json \
  results/config_used.yaml \
  results/environment.txt

# Upload to Hugging Face
huggingface-cli upload your-org/atlas-reproduction \
  atlas_reproduction.tar.gz

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

Reproduction Guide

Overview

Environment Setup

Hardware Requirements

Full Reproduction

Quick Validation

Software Stack

Configuration Files

Full Reproduction Steps

Quick Validation

Expected Metrics

Monitoring Training

Real-time Metrics

Key Indicators

Troubleshooting

Validation Snippets

Statistical Significance Test

Performance Verification Checklist

Artifact Management

Required Outputs

Next Steps

Methodology

Deploy Model

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

​Overview

​Environment Setup

​Hardware Requirements

Full Reproduction

Quick Validation

​Software Stack

​Configuration Files

​Full Reproduction Steps

​Quick Validation

​Expected Metrics

​Monitoring Training

​Real-time Metrics

​Key Indicators

​Troubleshooting

​Validation Snippets

​Statistical Significance Test

​Performance Verification Checklist

​Artifact Management

​Required Outputs

​Sharing Results

​Next Steps

Methodology

Deploy Model

Overview

Environment Setup

Hardware Requirements

Software Stack

Configuration Files

Full Reproduction Steps

Quick Validation

Expected Metrics

Monitoring Training

Real-time Metrics

Key Indicators

Troubleshooting

Validation Snippets

Statistical Significance Test

Performance Verification Checklist

Artifact Management

Required Outputs

Sharing Results

Next Steps