Core Principles
ATLAS evaluation verifies that the adaptive dual-agent loop (student + verifying teacher) improves outcomes without degrading performance for capable students. The framework measures both quantitative metrics and qualitative guidance effectiveness.Non-Degradation
Ensure teaching never harms performance (≥97% safety rate)
Efficiency
Measure token reduction and speed improvements
Generalization
Validate across diverse tasks and model scales
Evaluation Protocol
Two-Pass Comparison Framework
1
Baseline Measurement
Run student model independently on evaluation tasks:
2
Dual-Agent Evaluation
Apply ATLAS two-pass protocol:
3
Performance Comparison
Calculate improvement metrics:
Non-Degradation Verification
Critical safety metric ensuring teaching never makes performance worse:| Metric | Definition | Target | Achieved |
|---|---|---|---|
| NDR (Non-Degradation Rate) | % of interactions with improvement ≥ 0 | ≥99% | 97% |
| Degradation Severity | Average loss when degradation occurs | <5% | 3.2% |
| Recovery Rate | % of degraded cases recovered in retry | >80% | 82% |
Efficiency Metrics
Comprehensive measurement of resource utilization:Evaluation Commands
Full Benchmark Suite
Complete evaluation with detailed logging:Quick Validation
Rapid testing for development iterations:Production Evaluation
Full-scale testing with statistical validation:Data Collection Framework
Quantitative Metrics
- Performance
- Efficiency
- Robustness
- Accuracy improvements vs baseline
- Task completion rates
- Per-category performance breakdown
- Statistical significance (p-values)
Qualitative Analysis
Systematic review of teaching quality:- Diagnostic Accuracy: How well does the probe identify capability gaps?
- Teaching Relevance: Is guidance targeted to identified weaknesses?
- Adaptation Quality: Does teaching adjust to student skill level?
- Failure Patterns: What causes degradation or teaching failures?
Statistical Validation
Significance Testing
All results require statistical validation:Sample Size Requirements
| Confidence Level | Effect Size | Required Samples |
|---|---|---|
| 95% | Large (0.8) | 26 per condition |
| 95% | Medium (0.5) | 64 per condition |
| 99% | Large (0.8) | 42 per condition |
| 99% | Medium (0.5) | 106 per condition |
Expected Outcomes
Successful evaluation demonstrates:Closed-Loop Accuracy
+15–30% lift with the dual-agent runtime (student + verifying teacher)
Offline GRPO
Sustained improvements by training on exported runtime traces
Completion
~100% vs ~69% baseline
Efficiency
~50% token reduction with teaching
Error Analysis Framework
Failure Mode Categorization
| Category | Frequency | Mitigation |
|---|---|---|
| Parsing errors | 2.1% | Improved normalization |
| Over-teaching | 0.9% | Adaptive threshold tuning |
| Capability mismatch | 0.5% | Enhanced diagnostic probes |
| Template failures | 0.3% | Expanded template coverage |
Diagnostic Accuracy
Measure probe effectiveness:Scalability Testing
Model Size Scaling
| Student Model | Teacher Model | Improvement | Efficiency |
|---|---|---|---|
| 4B params | 8B params | +18.2% | 0.42 TES |
| 7B params | 8B params | +15.7% | 0.38 TES |
| 13B params | 8B params | +12.3% | 0.35 TES |
| 70B params | 8B params | +8.9% | 0.31 TES |
Infrastructure Scaling
| Configuration | Throughput | Latency (p50) | Latency (p99) |
|---|---|---|---|
| 1×T4 GPU | 2 req/min | 30s | 45s |
| 4×A100 | 16 req/min | 3.75s | 5.2s |
| 8×H100 | 64 req/min | 0.94s | 1.3s |
Reproducibility Requirements
Environment Specification
Environment Specification
Configuration Documentation
Configuration Documentation
Artifact Preservation
Artifact Preservation
- Training logs (
wandbortensorboard) - Metric summaries (JSON format)
- Representative examples (10% sampling)
- Configuration files (complete YAML)