Documentation Index Fetch the complete documentation index at: https://docs.arc.computer/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This guide provides exact steps to reproduce the closed-loop +15.7% accuracy improvement and related metrics reported in our technical documentation. Once you reproduce the baseline, export the traces and run our offline GRPO pipeline to train a bespoke teacher checkpoint for your domain.
Reproduction requires 4×H100 GPUs for full-scale training. For smaller-scale validation, see the Quick Validation section.
Set up Environment
Hardware Requirements
Full Reproduction
4×H100 80GB GPUs
NVLink interconnect
128GB system RAM
500GB NVMe storage
Quick Validation
1×A100 40GB GPU
32GB system RAM
100GB storage
~4 hours runtime
Software Stack
# Python environment
python --version # 3.11 or 3.12 required
pip install -r requirements.txt
# Verify CUDA
nvidia-smi
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# Authenticate with Hugging Face
huggingface-cli login
Configuration Files
Key configuration files for reproduction:
# src/atlas_core/configs/recipe/teacher_sft.yaml
model_name_or_path : Qwen/Qwen3-8B-Instruct-2507
dataset_id_or_path : Arc-Intelligence/Arc-ATLAS-Teach-v0
output_dir : results/pre_rl_model
seed : 42
num_train_epochs : 1
# src/atlas_core/configs/recipe/teacher_rcl.yaml
model_name_or_path : results/pre_rl_model
dataset_id_or_path : Arc-Intelligence/Arc-ATLAS-Teach-v0
num_generations : 32
seed : 42
beta : 0.04
Full Reproduction Steps
Phase 1: SFT Warmup
Train the initial supervised fine-tuned model: scripts/launch.sh 4 src/atlas_core/configs/recipe/teacher_sft.yaml \
dataset_id_or_path=Arc-Intelligence/Arc-ATLAS-Teach-v0 \
output_dir=results/pre_rl_model \
seed= 42
Expected duration : 4-8 hours on 4×H100
Checkpoint size : ~16GB
Key metric : Loss < 0.5
Phase 2: GRPO Training
Run reinforcement learning with vLLM server: scripts/launch_with_server.sh 1 3 src/atlas_core/configs/recipe/teacher_rcl.yaml \
model_name_or_path=results/pre_rl_model \
dataset_id_or_path=Arc-Intelligence/Arc-ATLAS-Teach-v0 \
num_generations= 32 \
seed= 42 \
beta= 0.04
Expected duration : 24-48 hours on 4×H100
Key metrics :
Reward > 0.5
KL divergence < 10
Non-degradation rate > 95%
Phase 3: Evaluation
Validate final performance with the lightweight Transformers snippet below (no additional repo files required): python - << 'PY'
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "results/final_model"
DATASET = "Arc-Intelligence/Arc-ATLAS-Teach-v0"
DATASET_CONFIG = "rl"
SAMPLES = 32
dataset = load_dataset(DATASET, DATASET_CONFIG, split="validation").shuffle(seed=42).select(range(SAMPLES))
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto")
correct = 0
for example in dataset:
inputs = tokenizer(example["prompt"], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
prediction = tokenizer.decode(output[0], skip_special_tokens=True)
if example["ground_truth"].strip().lower() in prediction.lower():
correct += 1
accuracy = correct / SAMPLES
print(f"Accuracy over {SAMPLES} samples: {accuracy:.2%}")
PY
Expected results (closed-loop runtime + GRPO) :
Accuracy improvement: +15.7% ± 1.2%
Completion rate: +31% ± 2%
Non-degradation: ≥97%
Token savings: ~50%
To continue beyond the baseline, export the traces with the SDK and launch atlas-core offline-pipeline --export-path traces/runtime.jsonl to begin GRPO training.
Completion rate: ~100%
Token reduction: ~50%
Quick Validation
For rapid testing without full training:
# Download pre-trained checkpoint
huggingface-cli download Arc-Intelligence/ATLAS-8B-Thinking \
--local-dir checkpoints/teacher
# Run minimal training (4 steps)
scripts/launch_with_server.sh 1 1 src/atlas_core/configs/recipe/teacher_rcl.yaml \
model_name_or_path=checkpoints/teacher \
max_steps= 4 \
eval_steps= 1 \
report_to=null
# Verify performance (reuse the evaluation snippet above with fewer samples)
python - << 'PY'
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "checkpoints/teacher"
DATASET = "Arc-Intelligence/Arc-ATLAS-Teach-v0"
DATASET_CONFIG = "rl"
SAMPLES = 16
dataset = load_dataset(DATASET, DATASET_CONFIG, split="validation").shuffle(seed=7).select(range(SAMPLES))
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto")
correct = 0
for example in dataset:
inputs = tokenizer(example["prompt"], return_tensors="pt").to(model.device)
prediction = model.generate(**inputs, max_new_tokens=256)
decoded = tokenizer.decode(prediction[0], skip_special_tokens=True)
if example["ground_truth"].strip().lower() in decoded.lower():
correct += 1
print(f"Quick validation accuracy ({SAMPLES} samples): {correct / SAMPLES:.2%}")
PY
Expected Metrics
After successful reproduction, you should observe:
Metric Expected Value Tolerance Average accuracy gain (closed loop) +15.7% ±1.2% Max improvement (closed loop) +29.6% ±2.1% Completion rate ~100% ±2% Token reduction 50% ±5% Generation speedup 13.6% ±2% Non-degradation rate 97% ±1% Offline GRPO gain Sustained lift from training on exported traces Compute-bound
Monitoring Training
Real-time Metrics
# TensorBoard monitoring
tensorboard --logdir results/ --port 6006
# vLLM server health
watch -n 5 'curl -s http://localhost:8765/metrics'
# GPU utilization
nvidia-smi dmon -s u -d 5
Key Indicators
Healthy Training
Issues to Watch
GPU utilization > 90%
Reward trending upward
KL divergence stable (5-15)
Loss decreasing smoothly
No NaN/Inf values
GPU utilization < 70% → Check data loading
Reward plateauing → Adjust learning rate
KL divergence > 20 → Increase beta
Loss spikes → Check for bad samples
OOM errors → Reduce batch size
Troubleshooting
# Add gradient checkpointing
scripts/launch_with_server.sh 1 3 src/atlas_core/configs/recipe/teacher_rcl.yaml \
gradient_checkpointing= true \
per_device_train_batch_size= 1 \
gradient_accumulation_steps= 32
vLLM Server Connection Failed
# Check if port is in use
lsof -i :8765
# Use alternative port
scripts/launch_with_server.sh 1 3 src/atlas_core/configs/recipe/teacher_rcl.yaml \
vllm_port= 8766
# Enable optimizations
export TORCH_COMPILE = 1
export FLASH_ATTENTION = 1
scripts/launch_with_server.sh 1 3 src/atlas_core/configs/recipe/teacher_rcl.yaml \
tf32= true \
dataloader_num_workers= 4
# Re-login to Hugging Face
huggingface-cli logout
huggingface-cli login
# Verify access
huggingface-cli download Arc-Intelligence/ATLAS-8B-Thinking README.md
Validation Snippets
Statistical Significance Test
Drop this snippet into any Python session (or save it as tools/validate_significance.py) to compare baseline vs enhanced runs:
import numpy as np
from scipy import stats
def validate_improvement ( baseline_file , enhanced_file ):
baseline = np.load(baseline_file)[ 'accuracy' ]
enhanced = np.load(enhanced_file)[ 'accuracy' ]
t_stat, p_value = stats.ttest_rel(enhanced, baseline)
print ( f "Improvement: { np.mean(enhanced - baseline) :.3f} " )
print ( f "P-value: { p_value :.6f} " )
print ( f "Significant: { p_value < 0.001 } " )
Use this helper to confirm the reproduced metrics stay within tolerance bands before sharing results:
def verify_benchmarks ( metrics ):
expected = {
'accuracy_gain' : ( 0.157 , 0.012 ), # mean, tolerance
'completion_rate' : ( 1.0 , 0.02 ),
'token_reduction' : ( 0.5 , 0.05 ),
'speed_gain' : ( 0.136 , 0.02 )
}
for metric, (expected_val, tolerance) in expected.items():
actual = metrics[metric]
within_tolerance = abs (actual - expected_val) <= tolerance
status = "PASS" if within_tolerance else "FAIL"
print ( f " { metric } : { actual :.3f} (expected { expected_val :.3f} ± { tolerance :.3f} ) [ { status } ]" )
Artifact Management
Required Outputs
Save these artifacts for verification:
results/
├── pre_rl_model/ # SFT checkpoint
├── final_model/ # GRPO checkpoint
├── eval_results.json # Evaluation metrics
├── training_logs/ # TensorBoard logs
├── config_used.yaml # Exact configuration
└── environment.txt # pip freeze output
Sharing Results
# Package for sharing
tar -czf atlas_reproduction.tar.gz \
results/eval_results.json \
results/config_used.yaml \
results/environment.txt
# Upload to Hugging Face
huggingface-cli upload your-org/atlas-reproduction \
atlas_reproduction.tar.gz
Next Steps
Methodology Understand evaluation protocol
Deploy Model Use your trained model