Overview
The teacher-student paradigm in ATLAS establishes an asymmetric learning relationship where a specialized teacher model enhances any student model’s capabilities without modifying the student’s weights or architecture.Core Architecture
Model Roles
Technical Specifications
Model Requirements
Component | Specification | Purpose |
---|---|---|
Teacher Model | 8B parameters, RL-trained | Provides adaptive guidance |
Student Model | Any size (4B-70B+) | Executes enhanced reasoning |
Context Window | 4096-32768 tokens | Accommodates teaching interaction |
Inference Time | +30% overhead | Two-pass protocol cost |
Capability Assessment
The teacher evaluates student competence through targeted probes:Adaptive Teaching Strategies
Strategy Selection Matrix
Student Capability | Teaching Strategy | Guidance Tokens | Focus |
---|---|---|---|
WEAK | Comprehensive scaffolding | 200-300 | Step-by-step decomposition |
MODERATE | Targeted hints | 100-150 | Key insights and corrections |
STRONG | Minimal intervention | 50-100 | Edge case handling only |
Implementation Example
Empirical Performance
τ²-bench Results (Dual-Control Environment)
Our system was evaluated on τ²-bench’s most complex mms_issue tasks, establishing state-of-the-art performance:System | Pass@1 Rate | Pass@4 Rate | Notes |
---|---|---|---|
ATLAS Teacher-Student | 24.0% | 22.4% | Minimal degradation |
GPT-4.1 | 18.0% | 10.0% | -8pt drop |
Claude 3.7 Sonnet | 18.0% | 2.0% | -16pt drop |
o4-mini | 12.0% | 2.0% | -10pt drop |
Qwen3-8B (Student Only) | 4.1% | - | No teacher guidance |
Key Performance Metrics (from README)
- Average accuracy improvement: 15.7% across tasks
- Maximum improvement: 29.6% on specific domains
- Completion rate: 31% improvement (69% → 100%)
- Token efficiency: 50% reduction (4k → 2k tokens)
- Non-degradation rate: 97%
Key Observations
- 6x performance lift: Teacher guidance improves Qwen3-8B from 4.1% to 24.0%
- Consistency advantage: Minimal pass@4 degradation vs competitors
- Cross-domain transfer: Math-trained teacher successfully guides telecom tasks
Case Study: Mathematical Reasoning
Demonstrating teacher-student interaction on a complex problem:Task
“A bacteria culture doubles every 3 hours. Starting with 100 bacteria, how many will there be after 15 hours?”Interaction Flow
Diagnostic Response: “100 × 2 × 5 = 1000”Teacher Guidance:Enhanced Response: “3200 bacteria (100 × 2^5)”
Integration Patterns
Pattern 1: Direct Enhancement
Pattern 2: Batch Processing
Pattern 3: Streaming Applications
Advantages
Over Fine-tuning
- No retraining required: Works with frozen student models
- Preserves capabilities: No catastrophic forgetting
- Instant deployment: No training time or cost
Over Prompting
- Adaptive: Adjusts to student capability
- Consistent: Systematic improvement approach
- Efficient: Optimized token usage
Over Ensemble Methods
- Lower latency: Single student inference
- Lower cost: No multiple model calls
- Better interpretability: Clear teaching rationale
Implementation Best Practices
Teacher Model Selection
Teacher Model Selection
Choose based on task type:
- ATLAS-8B-Thinking: Mathematical and logical reasoning
- ATLAS-8B-Instruct: Code generation and technical tasks
- Custom trained: Domain-specific requirements
Student Model Compatibility
Student Model Compatibility
Verify student model supports:
- System prompts or instruction following
- Sufficient context length (>4K tokens)
- Deterministic generation (temperature control)
Performance Optimization
Performance Optimization
- Cache teacher guidance for repeated queries
- Batch similar tasks together
- Use streaming for interactive applications
- Monitor token usage for cost control
Next Steps
Adaptive Teaching Protocol
Detailed two-pass protocol mechanics
Integration Guide
Deploy teacher-student system
API Reference
Configuration reference
Performance Results
Detailed benchmarks
References
- ATLAS Technical Report - Complete methodology and architecture
- First Experiment - Hands-on teacher training
- τ²-bench Results - State-of-the-art performance