Core Principles
ATLAS evaluation verifies that adaptive teaching improves student outcomes without degrading performance for capable students. The framework measures both quantitative metrics and qualitative teaching effectiveness.Non-Degradation
Ensure teaching never harms performance (≥97% safety rate)
Efficiency
Measure token reduction and speed improvements
Generalization
Validate across diverse tasks and model scales
Evaluation Protocol
Two-Pass Comparison Framework
1
Baseline Measurement
Run student model independently on evaluation tasks:
2
Teacher-Assisted Evaluation
Apply ATLAS two-pass protocol:
3
Performance Comparison
Calculate improvement metrics:
Non-Degradation Verification
Critical safety metric ensuring teaching never makes performance worse:Metric | Definition | Target | Achieved |
---|---|---|---|
NDR (Non-Degradation Rate) | % of interactions with improvement ≥ 0 | ≥99% | 97% |
Degradation Severity | Average loss when degradation occurs | <5% | 3.2% |
Recovery Rate | % of degraded cases recovered in retry | >80% | 82% |
Efficiency Metrics
Comprehensive measurement of resource utilization:Evaluation Commands
Full Benchmark Suite
Complete evaluation with detailed logging:Quick Validation
Rapid testing for development iterations:Production Evaluation
Full-scale testing with statistical validation:Data Collection Framework
Quantitative Metrics
- Accuracy improvements vs baseline
- Task completion rates
- Per-category performance breakdown
- Statistical significance (p-values)
Qualitative Analysis
Systematic review of teaching quality:- Diagnostic Accuracy: How well does the probe identify capability gaps?
- Teaching Relevance: Is guidance targeted to identified weaknesses?
- Adaptation Quality: Does teaching adjust to student skill level?
- Failure Patterns: What causes degradation or teaching failures?
Statistical Validation
Significance Testing
All results require statistical validation:Sample Size Requirements
Confidence Level | Effect Size | Required Samples |
---|---|---|
95% | Large (0.8) | 26 per condition |
95% | Medium (0.5) | 64 per condition |
99% | Large (0.8) | 42 per condition |
99% | Medium (0.5) | 106 per condition |
Expected Outcomes
Successful evaluation demonstrates:Accuracy
+15-30% improvement across tasks
Completion
~100% vs ~69% baseline
Efficiency
50% token reduction
Speed
13-15% faster generation
Error Analysis Framework
Failure Mode Categorization
Category | Frequency | Mitigation |
---|---|---|
Parsing errors | 2.1% | Improved normalization |
Over-teaching | 0.9% | Adaptive threshold tuning |
Capability mismatch | 0.5% | Enhanced diagnostic probes |
Template failures | 0.3% | Expanded template coverage |
Diagnostic Accuracy
Measure probe effectiveness:Scalability Testing
Model Size Scaling
Student Model | Teacher Model | Improvement | Efficiency |
---|---|---|---|
4B params | 8B params | +18.2% | 0.42 TES |
7B params | 8B params | +15.7% | 0.38 TES |
13B params | 8B params | +12.3% | 0.35 TES |
70B params | 8B params | +8.9% | 0.31 TES |
Infrastructure Scaling
Configuration | Throughput | Latency (p50) | Latency (p99) |
---|---|---|---|
1×T4 GPU | 2 req/min | 30s | 45s |
4×A100 | 16 req/min | 3.75s | 5.2s |
8×H100 | 64 req/min | 0.94s | 1.3s |
Reproducibility Requirements
Environment Specification
Environment Specification
Configuration Documentation
Configuration Documentation
Artifact Preservation
Artifact Preservation
- Training logs (
wandb
ortensorboard
) - Metric summaries (JSON format)
- Representative examples (10% sampling)
- Configuration files (complete YAML)