Skip to main content
How do we know if teaching actually worked? ATLAS uses a team of AI judges to score every interaction. Instead of a single reward model that can be biased or brittle, ATLAS uses a multi-agent ensemble. Think of it like a medical panel: a team of general practitioners makes an initial diagnosis, and when they disagree, a specialist makes the final call. This achieves 93.7% accuracy on RewardBench V2 while keeping costs low.

The Two-Tier System

ATLAS Reward System Architecture

Tier 1: Fast ensemble evaluation → Tier 2: Expert arbiter when needed

How It Works

Tier 1: The Initial Team
  • Multiple efficient models (like gemini-2.5-flash) run in parallel
  • Each runs at different temperatures for diverse perspectives
  • They all score the same interaction independently
  • Fast and cheap for most cases
Tier 2: The Expert Arbiter
  • Only called when the team disagrees (high variance in scores)
  • Or when any judge reports low confidence
  • A more powerful model (like gemini-2.5-pro) reviews everything
  • Makes the final decision with full context
The key insight: Most cases are clear-cut and don’t need the expensive expert. When there’s genuine ambiguity, escalate to the specialist.

When Escalation Happens

The system escalates to Tier 2 when either:
  • High disagreement: Standard deviation of scores exceeds the threshold (default: 0.15)
  • Low confidence: Any judge reports high uncertainty (default: >0.3)
Otherwise, it uses the most confident judgment from Tier 1—saving both time and money.

Session-Level Evaluation

The reward system evaluates the complete trajectory after execution finishes. It derives 2-3 weighted principles tailored to the specific session, scores the trajectory against those principles, and extracts behavioral patterns for future learning.

How It Works

The evaluator receives the full session context—task, plan, steps, final answer, execution mode—and generates a structured evaluation:
{
  "principles": [
    {"name": "Correctness", "weight": 0.5, "description": "Final deliverable matches requirements"},
    {"name": "Safety", "weight": 0.3, "description": "No policy violations detected"},
    {"name": "Efficiency", "weight": 0.2, "description": "Minimal retries needed"}
  ],
  "score": 0.85,
  "rationale": "Response solves the task correctly with efficient execution",
  "uncertainty": 0.1,
  "student_learning": "For straightforward tasks, proceed directly to solution without exploratory steps",
  "teacher_learning": null
}
Key components:
  • Principles: Domain-relevant evaluation criteria with weights (sum to 1.0)
  • Score: Aggregated result in [0.0, 1.0] range
  • Rationale: Explanation grounded in the principles
  • Student learning: Cross-domain behavioral pattern to remember (not task-specific content)
  • Teacher learning: Pedagogical strategy that worked (when teacher provided guidance)
This makes every score fully auditable—you can see which principles were applied and why the judgment was made.

Defining Domain Objectives

The judge prompt system lets you express quality criteria in natural language without training custom models. The evaluator derives 2-3 weighted principles tailored to each trajectory, scores against those principles, and reconciles multiple judge opinions through the ensemble flow. How it works: The focus_prompt field in adaptive_teaching.reward accepts arbitrary evaluation criteria. The judge reads that prompt, generates domain-relevant principles (e.g., “Correctness: 0.5 weight”, “Safety: 0.3 weight”), evaluates the trajectory, and extracts behavioral patterns to store as learning memory.

Configuration Essentials

The reward system is configured via YAML, but you only need to understand a few key settings:

Core Settings

# configs/rim_config.yaml
rim:
  # Diversity: More temperatures = more diverse initial opinions
  temperatures: [0.2, 0.5, 0.8]

  # Escalation sensitivity
  variance_threshold: 0.15  # Lower = more escalations to expert

  # Which dimensions to evaluate
  active_judges:
    accuracy: true
    helpfulness: true
    process: true
    diagnostic: true

Key Tuning Knobs

Want more precision?
  • Lower variance_threshold to 0.10 → More cases go to the expert model
Need faster/cheaper evaluation?
  • Raise variance_threshold to 0.20 → Trust the initial team more often
  • Reduce temperatures to [0.3, 0.7] → Fewer ensemble members
Different use cases?
  • Adjust variance_threshold and uncertainty_threshold to control escalation frequency
  • Use focus_prompt to steer evaluation criteria toward specific domain objectives

Reward System in the Atlas SDK

The SDK runtime uses the same reward philosophy to control its execution loop. The rim block in configs/examples/openai_agent.yaml wires up the scorekeepers and escalation model:
# configs/examples/openai_agent.yaml
rim:
  small_model:
    provider: google
    model: gemini/gemini-2.5-flash
    api_key_env: GEMINI_API_KEY
    max_output_tokens: 8096
  large_model:
    provider: google
    model: gemini/gemini-2.5-flash
    api_key_env: GEMINI_API_KEY
    max_output_tokens: 8096
  judge_prompt: 'reward the agent for attending the issues mentioned in the task'
  variance_threshold: 0.15
  uncertainty_threshold: 0.3
During orchestration, this configuration tells the runtime how to behave:
  1. After execution completes, the session trajectory is evaluated using the small model at multiple temperatures for diverse perspectives.
  2. If variance across samples exceeds 0.15 or any sample reports uncertainty > 0.3, the system escalates to the large model arbiter.
  3. The final reward includes derived principles, score, rationale, and extracted learning patterns (student_learning, teacher_learning).
  4. The reward informs retry decisions and learning memory—patterns are stored for future sessions.
Want stricter quality control? Lower variance_threshold to increase arbiter usage. See the SDK Configuration Reference for complete syntax.
This mirrors the training world: the runtime uses rewards to keep the agent on track, while the training process uses the same signals to improve the underlying models.

Using the Reward System

In Training (Offline RL)

The reward system integrates seamlessly with the GRPO trainer:
from trainers.grpo import GRPOTrainer
from RIM.reward_adapter import RIMReward
from datasets import load_dataset

# 1. Instantiate reward system
reward_system = RIMReward(config_path='configs/rim_config.yaml')

# 2. Pass to trainer
trainer = GRPOTrainer(
    model="path/to/your/teacher_model",
    args=grpo_config,
    reward_funcs=[reward_system],  # Just pass it in
    train_dataset=train_dataset
)

# 3. Train - the reward system runs automatically
trainer.train()
The trainer handles calling the reward system with batches of data during the RL loop. You don’t need to manage it manually.

For Ad-hoc Evaluation

Quick evaluation of teaching effectiveness:
from RIM.reward_adapter import RIMReward

# Create reward system
reward = RIMReward(config_path='configs/rim_config.yaml')

# Evaluate a single interaction
result = reward.evaluate({
    'question': 'What is 2+2?',
    'baseline_response': 'It is 4',
    'taught_response': 'The answer is 4 because 2 plus 2 equals 4',
    'teaching': 'Explain your reasoning step by step'
})

print(f"Accuracy: {result['accuracy']}")
print(f"Helpfulness: {result['helpfulness']}")
print(f"Improvement: {result['helpfulness'] - result['baseline_accuracy']}")

In Continual Learning

In the SDK runtime, the same reward signals drive continual learning loops and help teams decide when to export traces for GRPO training. See the atlas-sdk documentation for details on wiring reward feedback into production orchestration.

Customizing Judges

Modifying Existing Judges

Judge behavior is controlled by their prompts in RIM/judges.py. To change what AccuracyJudge prioritizes:
# RIM/judges.py
class AccuracyJudge:
    def _build_prompt(self, inputs: Dict[str, Any]) -> str:
        # Customize this string to change evaluation criteria
        return f"""Evaluate these responses.

Prompt: {inputs.get('prompt', '')}
Response A: {inputs.get('response_a', '')}
Response B: {inputs.get('response_b', '')}

Step 1: Generate 2-3 evaluation principles with weights (must sum to 1.0)
Step 2: Score both responses against each principle
Step 3: Provide final scores (0.0 to 1.0)

Output JSON only: {{"principles": [...], "score_a": float, "score_b": float, "uncertainty": float}}"""

Adding a New Judge

Step 1: Create judge class (RIM/judges.py):
class CreativityJudge:
    def __init__(self):
        self.name = 'creativity'

    def evaluate(self, inputs: Dict[str, Any], model_fn, temperature: float):
        prompt = f"""Score creativity (0.0 = formulaic, 1.0 = highly creative).
        Response: {inputs.get('response', '')}
        Output JSON: {{"score": float, "rationale": str, "uncertainty": float}}"""

        response = model_fn(prompt, temperature)
        return json.loads(response)
Step 2: Register in reward adapter (RIM/reward_adapter.py):
from RIM.judges import AccuracyJudge, HelpfulnessJudge, CreativityJudge

class RIMReward:
    def __init__(self, ...):
        self.judges = {
            'accuracy': AccuracyJudge(),
            'helpfulness': HelpfulnessJudge(),
            'creativity': CreativityJudge()  # Add here
        }
Step 3: Enable in config (configs/rim_config.yaml):
active_judges:
  accuracy: true
  helpfulness: true
  creativity: true  # Enable new judge

Performance

RewardBench V2 Results

The ensemble-and-escalation architecture achieves 93.7% overall accuracy, significantly outperforming individual models:
  • Component model (gemini-2.5-flash): 77.7% on its own
  • System performance: 93.7% (+16 points)
The architecture creates a result greater than the sum of its parts.
ATLAS Reward System Leaderboard

Category Breakdown

Performance by Category
See the complete Reward System Technical Report for full analysis.

Monitoring Rewards During Training

The training logs include reward system outputs:
# Example log entry
{
  'step': 150,
  'rim_rewards': {
    'accuracy': 0.85,
    'helpfulness': 0.72,
    'process': 0.78,
    'diagnostic': 0.80
  },
  'rim_explanations': {
    'accuracy': 'Response correctly solves the problem with proper units',
    'helpfulness': 'Teaching improved reasoning structure significantly'
  },
  'escalation_rate': 0.23  # 23% of cases went to Tier 2
}
Monitor these to:
  • Spot prompt regressions (dropping helpfulness scores)
  • Identify misconfigured thresholds (escalation rate too high/low)
  • Validate teaching improvements (rising scores over time)

Next Steps

References