Overview

This guide provides exact steps to reproduce the 15.7% accuracy improvement and other benchmark results reported in our technical documentation.
Reproduction requires 4×H100 GPUs for full-scale training. For smaller-scale validation, see the Quick Validation section.

Environment Setup

Hardware Requirements

Full Reproduction

  • 4×H100 80GB GPUs
  • NVLink interconnect
  • 128GB system RAM
  • 500GB NVMe storage

Quick Validation

  • 1×A100 40GB GPU
  • 32GB system RAM
  • 100GB storage
  • ~4 hours runtime

Software Stack

# Python environment
python --version  # 3.11 or 3.12 required
pip install -r requirements.txt

# Verify CUDA
nvidia-smi
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# Authenticate with Hugging Face
huggingface-cli login

Configuration Files

Key configuration files for reproduction:
# configs/run/teacher_sft.yaml
model_name_or_path: Qwen/Qwen3-8B-Instruct-2507
dataset_id_or_path: Arc-Intelligence/Arc-ATLAS-Teach-v0
output_dir: results/pre_rl_model
seed: 42
num_train_epochs: 1

# configs/run/teacher_rcl.yaml
model_name_or_path: results/pre_rl_model
dataset_id_or_path: Arc-Intelligence/Arc-ATLAS-Teach-v0
num_generations: 32
seed: 42
beta: 0.04

Full Reproduction Steps

1

Phase 1: SFT Warmup

Train the initial supervised fine-tuned model:
scripts/launch.sh 4 configs/run/teacher_sft.yaml \
  dataset_id_or_path=Arc-Intelligence/Arc-ATLAS-Teach-v0 \
  output_dir=results/pre_rl_model \
  seed=42
Expected duration: 4-8 hours on 4×H100 Checkpoint size: ~16GB Key metric: Loss < 0.5
2

Phase 2: GRPO Training

Run reinforcement learning with vLLM server:
scripts/launch_with_server.sh 1 3 configs/run/teacher_rcl.yaml \
  model_name_or_path=results/pre_rl_model \
  dataset_id_or_path=Arc-Intelligence/Arc-ATLAS-Teach-v0 \
  num_generations=32 \
  seed=42 \
  beta=0.04
Expected duration: 24-48 hours on 4×H100 Key metrics:
  • Reward > 0.5
  • KL divergence < 10
  • Non-degradation rate > 95%
3

Phase 3: Evaluation

Validate final performance:
python scripts/evaluate_model.py \
  --model_path results/final_model \
  --dataset Arc-Intelligence/Arc-ATLAS-Teach-v0 \
  --num_samples 32 \
  --seed 42
Expected results:
  • Accuracy improvement: +15.7% ± 1.2%
  • Completion rate: ~100%
  • Token reduction: ~50%

Quick Validation

For rapid testing without full training:
# Download pre-trained checkpoint
huggingface-cli download Arc-Intelligence/ATLAS-8B-Thinking \
  --local-dir checkpoints/teacher

# Run minimal training (4 steps)
scripts/launch_with_server.sh 1 1 configs/run/teacher_rcl.yaml \
  model_name_or_path=checkpoints/teacher \
  max_steps=4 \
  eval_steps=1 \
  report_to=null

# Verify performance
python scripts/quick_eval.py --model checkpoints/teacher

Expected Metrics

After successful reproduction, you should observe:
MetricExpected ValueTolerance
Average accuracy gain+15.7%±1.2%
Max improvement+29.6%±2.1%
Completion rate~100%±2%
Token reduction50%±5%
Generation speedup13.6%±2%
Non-degradation rate97%±1%

Monitoring Training

Real-time Metrics

# TensorBoard monitoring
tensorboard --logdir results/ --port 6006

# vLLM server health
watch -n 5 'curl -s http://localhost:8765/metrics'

# GPU utilization
nvidia-smi dmon -s u -d 5

Key Indicators

  • GPU utilization > 90%
  • Reward trending upward
  • KL divergence stable (5-15)
  • Loss decreasing smoothly
  • No NaN/Inf values

Troubleshooting

# Add gradient checkpointing
scripts/launch_with_server.sh 1 3 configs/run/teacher_rcl.yaml \
  gradient_checkpointing=true \
  per_device_train_batch_size=1 \
  gradient_accumulation_steps=32
# Check if port is in use
lsof -i :8765

# Use alternative port
scripts/launch_with_server.sh 1 3 configs/run/teacher_rcl.yaml \
  vllm_port=8766
# Enable optimizations
export TORCH_COMPILE=1
export FLASH_ATTENTION=1

scripts/launch_with_server.sh 1 3 configs/run/teacher_rcl.yaml \
  tf32=true \
  dataloader_num_workers=4
# Re-login to Hugging Face
huggingface-cli logout
huggingface-cli login

# Verify access
huggingface-cli download Arc-Intelligence/ATLAS-8B-Thinking README.md

Validation Scripts

Statistical Significance Test

# scripts/validate_significance.py
import numpy as np
from scipy import stats

def validate_improvement(baseline_file, enhanced_file):
    baseline = np.load(baseline_file)['accuracy']
    enhanced = np.load(enhanced_file)['accuracy']

    # Paired t-test
    t_stat, p_value = stats.ttest_rel(enhanced, baseline)

    print(f"Improvement: {np.mean(enhanced - baseline):.3f}")
    print(f"P-value: {p_value:.6f}")
    print(f"Significant: {p_value < 0.001}")

if __name__ == "__main__":
    validate_improvement("baseline.npz", "enhanced.npz")

Performance Verification

# scripts/verify_metrics.py
def verify_benchmarks(results_dir):
    metrics = load_metrics(results_dir)

    expected = {
        'accuracy_gain': (0.157, 0.012),  # mean, tolerance
        'completion_rate': (1.0, 0.02),
        'token_reduction': (0.5, 0.05),
        'speed_gain': (0.136, 0.02)
    }

    for metric, (expected_val, tolerance) in expected.items():
        actual = metrics[metric]
        within_tolerance = abs(actual - expected_val) <= tolerance
        print(f"{metric}: {actual:.3f} (expected {expected_val:.3f} ±{tolerance:.3f}) {'✓' if within_tolerance else '✗'}")

Artifact Management

Required Outputs

Save these artifacts for verification:
results/
├── pre_rl_model/         # SFT checkpoint
├── final_model/          # GRPO checkpoint
├── eval_results.json     # Evaluation metrics
├── training_logs/        # TensorBoard logs
├── config_used.yaml      # Exact configuration
└── environment.txt       # pip freeze output

Sharing Results

# Package for sharing
tar -czf atlas_reproduction.tar.gz \
  results/eval_results.json \
  results/config_used.yaml \
  results/environment.txt

# Upload to Hugging Face
huggingface-cli upload your-org/atlas-reproduction \
  atlas_reproduction.tar.gz

Next Steps