Overview

The reward system in ATLAS quantifies teaching effectiveness through a carefully designed function that balances performance improvement, safety guarantees, and efficiency. This design directly shapes the teacher’s behavior during reinforcement learning.

Reward Function Architecture

Core Implementation

The reward function implements a multi-objective optimization:
class AdaptiveTeachingReward:
    """
    Reward function for ATLAS teacher training

    Source: trainers/teacher_rewards.py
    """

    def compute_reward(self,
                      baseline_score: float,
                      enhanced_score: float,
                      teaching_length: int,
                      efficiency_weight: float = 1.0) -> float:
        """
        Calculate reward based on performance delta and efficiency

        Args:
            baseline_score: Student performance without teaching
            enhanced_score: Student performance with teaching
            teaching_length: Tokens used for guidance
            efficiency_weight: Scaling factor for efficiency bonus

        Returns:
            Reward signal for GRPO optimization
        """
        # Performance delta
        delta = enhanced_score - baseline_score

        # Safety: Zero reward for degradation
        if delta < 0:
            return 0.0

        # Efficiency bonus: Prefer concise teaching
        efficiency_bonus = 100 / (100 + teaching_length)

        # No improvement but correct: Partial reward
        if delta == 0 and enhanced_score > 0.5:
            return 0.5 * (1 + efficiency_weight * efficiency_bonus)

        # Performance improvement: Full reward
        return delta * (1 + efficiency_weight * efficiency_bonus)

Mathematical Formulation

The reward function R(s, a) can be expressed as:
R(s, a) = {
    0,                                           if Δ < 0 (degradation)
    0.5 × (1 + λ × ε),                         if Δ = 0 ∧ P > 0.5 (maintained)
    Δ × (1 + λ × ε),                           if Δ > 0 (improvement)
}

Where:
- Δ = P_enhanced - P_baseline (performance delta)
- ε = 100 / (100 + L) (efficiency factor)
- λ = efficiency weight (default: 1.0)
- L = teaching length in tokens

Design Principles

1. Non-Degradation Guarantee

Zero reward for performance drops ensures safety:
def validate_non_degradation(results: List[TeachingResult]) -> float:
    """
    Measure non-degradation rate across teaching interactions
    """
    non_degraded = [
        r for r in results
        if r.enhanced_score >= r.baseline_score
    ]
    return len(non_degraded) / len(results)
Empirical Result: 97% non-degradation rate

2. Efficiency Incentive

The efficiency bonus encourages concise guidance:
Teaching LengthEfficiency BonusEffective Multiplier
50 tokens0.6671.667×
100 tokens0.5001.500×
200 tokens0.3331.333×
300 tokens0.2501.250×

3. Performance Correlation

Direct coupling between improvement and reward:
def analyze_reward_correlation(training_data: List[Episode]) -> Dict:
    """
    Analyze relationship between rewards and outcomes
    """
    improvements = [e.performance_delta for e in training_data]
    rewards = [e.reward for e in training_data]

    return {
        'pearson_r': pearsonr(improvements, rewards)[0],  # Expected: >0.8
        'spearman_rho': spearmanr(improvements, rewards)[0],
        'kendall_tau': kendalltau(improvements, rewards)[0]
    }

Configuration Parameters

Standard Configuration

# configs/trainer/reward/adaptive_teaching.yaml
degradation_penalty_multiplier: 2.0  # For future negative reward experiments
efficiency_weight: 1.0                # Standard efficiency scaling
max_probe_tokens: 500                 # Diagnostic probe limit
baseline_threshold: 0.5               # Minimum correctness for partial reward

Experimental Variations

efficiency_weight: 2.0  # Strong preference for conciseness
max_probe_tokens: 200   # Tighter token constraints
Effect: 65% reduction in average guidance length

Reward Shaping Strategies

Progressive Curriculum

Adjust rewards during training phases:
def curriculum_reward_schedule(epoch: int, base_config: RewardConfig) -> RewardConfig:
    """
    Progressive reward shaping across training
    """
    if epoch < 10:
        # Early: Focus on safety
        return RewardConfig(
            efficiency_weight=0.5,
            degradation_penalty_multiplier=3.0
        )
    elif epoch < 20:
        # Middle: Balance all objectives
        return base_config
    else:
        # Late: Optimize efficiency
        return RewardConfig(
            efficiency_weight=1.5,
            degradation_penalty_multiplier=1.0
        )

Domain-Specific Rewards

Customize rewards for different task types:
class DomainAdaptiveReward(AdaptiveTeachingReward):
    def compute_reward(self, task_type: str, **kwargs) -> float:
        base_reward = super().compute_reward(**kwargs)

        if task_type == "debugging":
            # Prioritize correctness over efficiency
            return base_reward * 1.2 if kwargs['enhanced_score'] > 0.9 else base_reward

        elif task_type == "reasoning":
            # Balance all factors equally
            return base_reward

        elif task_type == "coding":
            # Heavy efficiency emphasis
            efficiency_multiplier = 150 / (100 + kwargs['teaching_length'])
            return base_reward * efficiency_multiplier

Empirical Analysis

Reward Distribution

Analysis of 10,000 training episodes:
def analyze_reward_distribution(episodes: List[Episode]) -> Dict:
    rewards = [e.reward for e in episodes]

    return {
        'mean': np.mean(rewards),           # 0.42
        'std': np.std(rewards),              # 0.31
        'zero_rewards': sum(r == 0 for r in rewards) / len(rewards),  # 12%
        'max_rewards': sum(r > 0.8 for r in rewards) / len(rewards),  # 18%
        'quartiles': np.percentile(rewards, [25, 50, 75])  # [0.15, 0.38, 0.61]
    }

Learning Dynamics

Reward progression during training:
Training PhaseAvg RewardEfficiencyNon-Degradation
Epoch 1-50.230.4289%
Epoch 6-100.380.5894%
Epoch 11-200.510.7197%
Epoch 21-300.560.8297%

Optimization Impact

GRPO Integration

The reward signal directly influences policy updates:
def grpo_policy_update(self, trajectories: List[Trajectory]) -> Dict:
    """
    Update policy using reward-weighted gradients
    """
    for trajectory in trajectories:
        # Compute advantages using rewards
        advantages = self.compute_advantages(trajectory.rewards)

        # Weight log probabilities by advantages
        weighted_logprobs = trajectory.logprobs * advantages

        # Policy gradient step
        loss = -weighted_logprobs.mean()
        loss.backward()

    return {'loss': loss.item(), 'mean_reward': np.mean([t.rewards for t in trajectories])}

Convergence Analysis

Training typically converges when:
  • Mean reward plateaus around 0.55-0.60
  • Non-degradation rate exceeds 95%
  • Efficiency bonus stabilizes at 0.7-0.8
Common issues and solutions:
  • Reward hacking: Teacher provides generic advice → Add diversity penalty
  • Over-verbose teaching: Ignoring efficiency → Increase efficiency_weight
  • Under-teaching: Minimal intervention → Reduce efficiency_weight
Most sensitive parameters:
  1. efficiency_weight: ±0.5 changes behavior significantly
  2. baseline_threshold: ±0.2 affects partial reward frequency
  3. max_probe_tokens: ±200 impacts diagnosis quality

Advanced Techniques

Multi-Objective Optimization

class MultiObjectiveReward:
    """
    Extended reward with multiple objectives
    """
    def compute_reward(self, result: TeachingResult) -> float:
        objectives = {
            'performance': result.performance_delta,
            'efficiency': 100 / (100 + result.teaching_length),
            'diversity': self.compute_diversity(result.guidance),
            'correctness': result.enhanced_score,
            'safety': 1.0 if result.performance_delta >= 0 else 0.0
        }

        weights = {
            'performance': 0.35,
            'efficiency': 0.25,
            'diversity': 0.15,
            'correctness': 0.20,
            'safety': 0.05
        }

        return sum(objectives[k] * weights[k] for k in objectives)

Inverse Reinforcement Learning

Learn rewards from expert demonstrations:
def learn_reward_function(expert_demos: List[Demonstration]) -> RewardFunction:
    """
    Infer reward function from expert teaching examples
    """
    features = extract_features(expert_demos)
    rewards = estimate_rewards(expert_demos)

    # Learn linear combination of features
    weights = np.linalg.lstsq(features, rewards, rcond=None)[0]

    return lambda state, action: np.dot(extract_state_features(state, action), weights)

Next Steps

References