The ATLAS Reward System

How do we know if teaching actually worked? ATLAS uses a team of AI judges to score every interaction. Instead of a single reward model that can be biased or brittle, ATLAS uses a multi-agent ensemble. Think of it like a medical panel: a team of general practitioners makes an initial diagnosis, and when they disagree, a specialist makes the final call. This achieves 93.7% accuracy on RewardBench V2 while keeping costs low.

The Two-Tier System

Tier 1: Fast ensemble evaluation → Tier 2: Expert arbiter when needed

How It Works

Tier 1: The Initial Team

Multiple efficient models (like gemini-2.5-flash) run in parallel
Each runs at different temperatures for diverse perspectives
They all score the same interaction independently
Fast and cheap for most cases

Tier 2: The Expert Arbiter

Only called when the team disagrees (high variance in scores)
Or when any judge reports low confidence
A more powerful model (like gemini-2.5-pro) reviews everything
Makes the final decision with full context

The key insight: Most cases are clear-cut and don’t need the expensive expert. When there’s genuine ambiguity, escalate to the specialist.

When Escalation Happens

The system escalates to Tier 2 when either:

High disagreement: Standard deviation of scores exceeds the threshold (default: 0.15)
Low confidence: Any judge reports high uncertainty (default: >0.3)

Otherwise, it uses the most confident judgment from Tier 1—saving both time and money.

Session-Level Evaluation

The reward system evaluates the complete trajectory after execution finishes. It derives 2-3 weighted principles tailored to the specific session, scores the trajectory against those principles, and extracts behavioral patterns for future learning.

How It Works

The evaluator receives the full session context—task, plan, steps, final answer, execution mode—and generates a structured evaluation:

{
  "principles": [
    {"name": "Correctness", "weight": 0.5, "description": "Final deliverable matches requirements"},
    {"name": "Safety", "weight": 0.3, "description": "No policy violations detected"},
    {"name": "Efficiency", "weight": 0.2, "description": "Minimal retries needed"}
  ],
  "score": 0.85,
  "rationale": "Response solves the task correctly with efficient execution",
  "uncertainty": 0.1,
  "student_learning": "For straightforward tasks, proceed directly to solution without exploratory steps",
  "teacher_learning": null
}

Key components:

Principles: Domain-relevant evaluation criteria with weights (sum to 1.0)
Score: Aggregated result in [0.0, 1.0] range
Rationale: Explanation grounded in the principles
Student learning: Cross-domain behavioral pattern to remember (not task-specific content)
Teacher learning: Pedagogical strategy that worked (when teacher provided guidance)

This makes every score fully auditable—you can see which principles were applied and why the judgment was made.

Defining Domain Objectives

The judge prompt system lets you express quality criteria in natural language without training custom models. The evaluator derives 2-3 weighted principles tailored to each trajectory, scores against those principles, and reconciles multiple judge opinions through the ensemble flow. How it works: The focus_prompt field in adaptive_teaching.reward accepts arbitrary evaluation criteria. The judge reads that prompt, generates domain-relevant principles (e.g., “Correctness: 0.5 weight”, “Safety: 0.3 weight”), evaluates the trajectory, and extracts behavioral patterns to store as learning memory.

Configuration Essentials

The reward system is configured via YAML, but you only need to understand a few key settings:

Core Settings

# configs/rim_config.yaml
rim:
  # Diversity: More temperatures = more diverse initial opinions
  temperatures: [0.2, 0.5, 0.8]

  # Escalation sensitivity
  variance_threshold: 0.15  # Lower = more escalations to expert

  # Which dimensions to evaluate
  active_judges:
    accuracy: true
    helpfulness: true
    process: true
    diagnostic: true

Key Tuning Knobs

Want more precision?

Lower variance_threshold to 0.10 → More cases go to the expert model

Need faster/cheaper evaluation?

Raise variance_threshold to 0.20 → Trust the initial team more often
Reduce temperatures to [0.3, 0.7] → Fewer ensemble members

Different use cases?

Adjust variance_threshold and uncertainty_threshold to control escalation frequency
Use focus_prompt to steer evaluation criteria toward specific domain objectives

Reward System in the Atlas SDK

The SDK runtime uses the same reward philosophy to control its execution loop. The rim block in configs/examples/openai_agent.yaml wires up the scorekeepers and escalation model:

# configs/examples/openai_agent.yaml
rim:
  small_model:
    provider: google
    model: gemini/gemini-2.5-flash
    api_key_env: GEMINI_API_KEY
    max_output_tokens: 8096
  large_model:
    provider: google
    model: gemini/gemini-2.5-flash
    api_key_env: GEMINI_API_KEY
    max_output_tokens: 8096
  judge_prompt: 'reward the agent for attending the issues mentioned in the task'
  variance_threshold: 0.15
  uncertainty_threshold: 0.3

During orchestration, this configuration tells the runtime how to behave:

After execution completes, the session trajectory is evaluated using the small model at multiple temperatures for diverse perspectives.
If variance across samples exceeds 0.15 or any sample reports uncertainty > 0.3, the system escalates to the large model arbiter.
The final reward includes derived principles, score, rationale, and extracted learning patterns (student_learning, teacher_learning).
The reward informs retry decisions and learning memory—patterns are stored for future sessions.

Want stricter quality control? Lower variance_threshold to increase arbiter usage. See the SDK Configuration Reference for complete syntax.

This mirrors the training world: the runtime uses rewards to keep the agent on track, while the training process uses the same signals to improve the underlying models.

Using the Reward System

In Training (Offline RL)

The reward system integrates seamlessly with the GRPO trainer:

from trainers.grpo import GRPOTrainer
from RIM.reward_adapter import RIMReward
from datasets import load_dataset

# 1. Instantiate reward system
reward_system = RIMReward(config_path='configs/rim_config.yaml')

# 2. Pass to trainer
trainer = GRPOTrainer(
    model="path/to/your/teacher_model",
    args=grpo_config,
    reward_funcs=[reward_system],  # Just pass it in
    train_dataset=train_dataset
)

# 3. Train - the reward system runs automatically
trainer.train()

The trainer handles calling the reward system with batches of data during the RL loop. You don’t need to manage it manually.

For Ad-hoc Evaluation

Quick evaluation of teaching effectiveness:

from RIM.reward_adapter import RIMReward

# Create reward system
reward = RIMReward(config_path='configs/rim_config.yaml')

# Evaluate a single interaction
result = reward.evaluate({
    'question': 'What is 2+2?',
    'baseline_response': 'It is 4',
    'taught_response': 'The answer is 4 because 2 plus 2 equals 4',
    'teaching': 'Explain your reasoning step by step'
})

print(f"Accuracy: {result['accuracy']}")
print(f"Helpfulness: {result['helpfulness']}")
print(f"Improvement: {result['helpfulness'] - result['baseline_accuracy']}")

In Continual Learning

In the SDK runtime, the same reward signals drive continual learning loops and help teams decide when to export traces for GRPO training. See the atlas-sdk documentation for details on wiring reward feedback into production orchestration.

Customizing Judges

Modifying Existing Judges

Judge behavior is controlled by their prompts in RIM/judges.py. To change what AccuracyJudge prioritizes:

# RIM/judges.py
class AccuracyJudge:
    def _build_prompt(self, inputs: Dict[str, Any]) -> str:
        # Customize this string to change evaluation criteria
        return f"""Evaluate these responses.

Prompt: {inputs.get('prompt', '')}
Response A: {inputs.get('response_a', '')}
Response B: {inputs.get('response_b', '')}

Step 1: Generate 2-3 evaluation principles with weights (must sum to 1.0)
Step 2: Score both responses against each principle
Step 3: Provide final scores (0.0 to 1.0)

Output JSON only: {{"principles": [...], "score_a": float, "score_b": float, "uncertainty": float}}"""

Adding a New Judge

Step 1: Create judge class (RIM/judges.py):

class CreativityJudge:
    def __init__(self):
        self.name = 'creativity'

    def evaluate(self, inputs: Dict[str, Any], model_fn, temperature: float):
        prompt = f"""Score creativity (0.0 = formulaic, 1.0 = highly creative).
        Response: {inputs.get('response', '')}
        Output JSON: {{"score": float, "rationale": str, "uncertainty": float}}"""

        response = model_fn(prompt, temperature)
        return json.loads(response)

Step 2: Register in reward adapter (RIM/reward_adapter.py):

from RIM.judges import AccuracyJudge, HelpfulnessJudge, CreativityJudge

class RIMReward:
    def __init__(self, ...):
        self.judges = {
            'accuracy': AccuracyJudge(),
            'helpfulness': HelpfulnessJudge(),
            'creativity': CreativityJudge()  # Add here
        }

Step 3: Enable in config (configs/rim_config.yaml):

active_judges:
  accuracy: true
  helpfulness: true
  creativity: true  # Enable new judge

Performance

RewardBench V2 Results

The ensemble-and-escalation architecture achieves 93.7% overall accuracy, significantly outperforming individual models:

Component model (gemini-2.5-flash): 77.7% on its own
System performance: 93.7% (+16 points)

The architecture creates a result greater than the sum of its parts.

Category Breakdown

See the complete Reward System Technical Report for full analysis.

Monitoring Rewards During Training

The training logs include reward system outputs:

# Example log entry
{
  'step': 150,
  'rim_rewards': {
    'accuracy': 0.85,
    'helpfulness': 0.72,
    'process': 0.78,
    'diagnostic': 0.80
  },
  'rim_explanations': {
    'accuracy': 'Response correctly solves the problem with proper units',
    'helpfulness': 'Teaching improved reasoning structure significantly'
  },
  'escalation_rate': 0.23  # 23% of cases went to Tier 2
}

Monitor these to:

Spot prompt regressions (dropping helpfulness scores)
Identify misconfigured thresholds (escalation rate too high/low)
Validate teaching improvements (rising scores over time)

Next Steps

Adaptive Dual-Agent Reasoning

See how the dual-agent workflow and lane logic operate

GRPO Training

Use the reward system to train teacher models

SDK Runtime

See how rewards flow through the production loop

References

Reward System Technical Report - Complete methodology and benchmarks
ATLAS Technical Report - How rewards integrate with training
RewardBench V2 - Benchmark leaderboard

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

The ATLAS Reward System

The Two-Tier System

How It Works

When Escalation Happens

Session-Level Evaluation

How It Works

Defining Domain Objectives

Configuration Essentials

Core Settings

Key Tuning Knobs

Reward System in the Atlas SDK

Using the Reward System

In Training (Offline RL)

For Ad-hoc Evaluation

In Continual Learning

Customizing Judges

Modifying Existing Judges

Adding a New Judge

Performance

RewardBench V2 Results

Category Breakdown

Monitoring Rewards During Training

Next Steps

Adaptive Dual-Agent Reasoning

GRPO Training

SDK Runtime

References

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

​The Two-Tier System

​How It Works

​When Escalation Happens

​Session-Level Evaluation

​How It Works

​Defining Domain Objectives

​Configuration Essentials

​Core Settings

​Key Tuning Knobs

​Reward System in the Atlas SDK

​Using the Reward System

​In Training (Offline RL)

​For Ad-hoc Evaluation

​In Continual Learning

​Customizing Judges

​Modifying Existing Judges

​Adding a New Judge

​Performance

​RewardBench V2 Results

​Category Breakdown

​Monitoring Rewards During Training

​Next Steps

Adaptive Dual-Agent Reasoning

GRPO Training

SDK Runtime

​References

The Two-Tier System

How It Works

When Escalation Happens

Session-Level Evaluation

How It Works

Defining Domain Objectives

Configuration Essentials

Core Settings

Key Tuning Knobs

Reward System in the Atlas SDK

Using the Reward System

In Training (Offline RL)

For Ad-hoc Evaluation

In Continual Learning

Customizing Judges

Modifying Existing Judges

Adding a New Judge

Performance

RewardBench V2 Results

Category Breakdown

Monitoring Rewards During Training

Next Steps

References