Overview
The reward system in ATLAS quantifies teaching effectiveness through a carefully designed function that balances performance improvement, safety guarantees, and efficiency. This design directly shapes the teacher’s behavior during reinforcement learning.Reward Function Architecture
Core Implementation
The reward function implements a multi-objective optimization:Mathematical Formulation
The reward function R(s, a) can be expressed as:Design Principles
1. Non-Degradation Guarantee
Zero reward for performance drops ensures safety:2. Efficiency Incentive
The efficiency bonus encourages concise guidance:Teaching Length | Efficiency Bonus | Effective Multiplier |
---|---|---|
50 tokens | 0.667 | 1.667× |
100 tokens | 0.500 | 1.500× |
200 tokens | 0.333 | 1.333× |
300 tokens | 0.250 | 1.250× |
3. Performance Correlation
Direct coupling between improvement and reward:Configuration Parameters
Standard Configuration
Experimental Variations
Reward Shaping Strategies
Progressive Curriculum
Adjust rewards during training phases:Domain-Specific Rewards
Customize rewards for different task types:Empirical Analysis
Reward Distribution
Analysis of 10,000 training episodes:Learning Dynamics
Reward progression during training:Training Phase | Avg Reward | Efficiency | Non-Degradation |
---|---|---|---|
Epoch 1-5 | 0.23 | 0.42 | 89% |
Epoch 6-10 | 0.38 | 0.58 | 94% |
Epoch 11-20 | 0.51 | 0.71 | 97% |
Epoch 21-30 | 0.56 | 0.82 | 97% |
Optimization Impact
GRPO Integration
The reward signal directly influences policy updates:Convergence Analysis
Reward Convergence
Reward Convergence
Training typically converges when:
- Mean reward plateaus around 0.55-0.60
- Non-degradation rate exceeds 95%
- Efficiency bonus stabilizes at 0.7-0.8
Failure Modes
Failure Modes
Common issues and solutions:
- Reward hacking: Teacher provides generic advice → Add diversity penalty
- Over-verbose teaching: Ignoring efficiency → Increase efficiency_weight
- Under-teaching: Minimal intervention → Reduce efficiency_weight
Hyperparameter Sensitivity
Hyperparameter Sensitivity
Most sensitive parameters:
- efficiency_weight: ±0.5 changes behavior significantly
- baseline_threshold: ±0.2 affects partial reward frequency
- max_probe_tokens: ±200 impacts diagnosis quality
Advanced Techniques
Multi-Objective Optimization
Inverse Reinforcement Learning
Learn rewards from expert demonstrations:Next Steps
RL Training Guide
How rewards guide GRPO optimization
Performance Benchmarks
Reward impact on final metrics
Custom Rewards
Implement domain-specific rewards
Technical Report
Theoretical foundations
References
- ATLAS Technical Report - Section 3.3 on reward design
- GRPO Paper - Policy gradient foundations
- Adaptive Teaching Protocol - How rewards shape teaching behavior