Core Principles

ATLAS evaluation verifies that adaptive teaching improves student outcomes without degrading performance for capable students. The framework measures both quantitative metrics and qualitative teaching effectiveness.

Non-Degradation

Ensure teaching never harms performance (≥97% safety rate)

Efficiency

Measure token reduction and speed improvements

Generalization

Validate across diverse tasks and model scales

Evaluation Protocol

Two-Pass Comparison Framework

1

Baseline Measurement

Run student model independently on evaluation tasks:
baseline_response = student_model.generate(task)
baseline_accuracy = evaluate(baseline_response)
2

Teacher-Assisted Evaluation

Apply ATLAS two-pass protocol:
# Pass 1: Diagnostic probe (≤50 tokens)
capability = teacher.diagnose(student_response)

# Pass 2: Adaptive teaching (≤200 tokens)
guidance = teacher.generate_guidance(capability, task)
enhanced_response = student.generate(task, guidance)
3

Performance Comparison

Calculate improvement metrics:
improvement = enhanced_accuracy - baseline_accuracy
non_degradation = (improvement >= 0)
efficiency_gain = (baseline_tokens - enhanced_tokens) / baseline_tokens

Non-Degradation Verification

Critical safety metric ensuring teaching never makes performance worse:
MetricDefinitionTargetAchieved
NDR (Non-Degradation Rate)% of interactions with improvement ≥ 0≥99%97%
Degradation SeverityAverage loss when degradation occurs<5%3.2%
Recovery Rate% of degraded cases recovered in retry>80%82%

Efficiency Metrics

Comprehensive measurement of resource utilization:
# Teaching Efficiency Score (TES)
TES = (accuracy_gain * completion_rate) / (teaching_tokens / 1000)

# Learning Rate (LR)
LR = Δ_performance / num_interactions

# Token Efficiency
efficiency = 1 - (enhanced_tokens / baseline_tokens)

Evaluation Commands

Full Benchmark Suite

Complete evaluation with detailed logging:
# Run comprehensive evaluation
scripts/launch_with_server.sh 1 3 configs/run/teacher_rcl.yaml \
  model_name_or_path=results/pre_rl_model \
  dataset_id_or_path=Arc-Intelligence/Arc-ATLAS-Teach-v0 \
  eval_steps=50 \
  log_completions=true \
  save_completions_probability=0.1 \
  num_generations=32

Quick Validation

Rapid testing for development iterations:
# Minimal evaluation (4 steps)
scripts/launch_with_server.sh 1 1 configs/run/teacher_rcl.yaml \
  report_to=null \
  max_steps=4 \
  eval_steps=1

Production Evaluation

Full-scale testing with statistical validation:
# Multi-seed evaluation for significance testing
for seed in 42 1337 2024; do
  scripts/launch_with_server.sh 4 4 configs/run/teacher_rcl.yaml \
    seed=$seed \
    output_dir=results/eval_seed_$seed \
    dataset_id_or_path=Arc-Intelligence/Arc-ATLAS-Teach-v0
done

Data Collection Framework

Quantitative Metrics

  • Accuracy improvements vs baseline
  • Task completion rates
  • Per-category performance breakdown
  • Statistical significance (p-values)

Qualitative Analysis

Systematic review of teaching quality:
  1. Diagnostic Accuracy: How well does the probe identify capability gaps?
  2. Teaching Relevance: Is guidance targeted to identified weaknesses?
  3. Adaptation Quality: Does teaching adjust to student skill level?
  4. Failure Patterns: What causes degradation or teaching failures?

Statistical Validation

Significance Testing

All results require statistical validation:
from scipy import stats

def validate_improvement(baseline_scores, enhanced_scores):
    # Paired t-test for matched samples
    t_stat, p_value = stats.ttest_rel(enhanced_scores, baseline_scores)

    # Cohen's d for effect size
    diff = np.mean(enhanced_scores - baseline_scores)
    pooled_std = np.sqrt((np.var(baseline_scores) + np.var(enhanced_scores)) / 2)
    cohens_d = diff / pooled_std

    return {
        'significant': p_value < 0.001,
        'p_value': p_value,
        'effect_size': cohens_d,
        'improvement': diff
    }

Sample Size Requirements

Confidence LevelEffect SizeRequired Samples
95%Large (0.8)26 per condition
95%Medium (0.5)64 per condition
99%Large (0.8)42 per condition
99%Medium (0.5)106 per condition

Expected Outcomes

Successful evaluation demonstrates:

Accuracy

+15-30% improvement across tasks

Completion

~100% vs ~69% baseline

Efficiency

50% token reduction

Speed

13-15% faster generation

Error Analysis Framework

Failure Mode Categorization

CategoryFrequencyMitigation
Parsing errors2.1%Improved normalization
Over-teaching0.9%Adaptive threshold tuning
Capability mismatch0.5%Enhanced diagnostic probes
Template failures0.3%Expanded template coverage

Diagnostic Accuracy

Measure probe effectiveness:
def evaluate_diagnostic_accuracy(probe_results, actual_performance):
    # Categories: weak, medium, strong
    predicted_level = categorize_capability(probe_results)
    actual_level = categorize_performance(actual_performance)

    accuracy = (predicted_level == actual_level).mean()
    confusion_matrix = create_confusion_matrix(predicted_level, actual_level)

    return accuracy, confusion_matrix

Scalability Testing

Model Size Scaling

Student ModelTeacher ModelImprovementEfficiency
4B params8B params+18.2%0.42 TES
7B params8B params+15.7%0.38 TES
13B params8B params+12.3%0.35 TES
70B params8B params+8.9%0.31 TES

Infrastructure Scaling

ConfigurationThroughputLatency (p50)Latency (p99)
1×T4 GPU2 req/min30s45s
4×A10016 req/min3.75s5.2s
8×H10064 req/min0.94s1.3s

Reproducibility Requirements

# Required in reproduction logs
hardware:
  gpus: 4×H100
  memory: 128GB
  interconnect: NVLink
software:
  python: 3.11.4
  pytorch: 2.1.0
  transformers: 4.36.0
  vllm: 0.2.7
# Save all overrides
echo "Configuration:" > eval_config.txt
echo "model_name_or_path=$MODEL" >> eval_config.txt
echo "dataset_id_or_path=$DATASET" >> eval_config.txt
echo "seed=$SEED" >> eval_config.txt
  • Training logs (wandb or tensorboard)
  • Metric summaries (JSON format)
  • Representative examples (10% sampling)
  • Configuration files (complete YAML)

Next Steps