Skip to main content
How do we know if teaching actually worked? ATLAS uses a team of AI judges to score every interaction. Instead of a single reward model that can be biased or brittle, ATLAS uses a multi-agent ensemble. Think of it like a medical panel: a team of general practitioners makes an initial diagnosis, and when they disagree, a specialist makes the final call. This achieves 93.7% accuracy on RewardBench V2 while keeping costs low.

The Two-Tier System

ATLAS Reward System Architecture

Tier 1: Fast ensemble evaluation → Tier 2: Expert arbiter when needed

How It Works

Tier 1: The Initial Team
  • Multiple efficient models (like gemini-2.5-flash) run in parallel
  • Each runs at different temperatures for diverse perspectives
  • They all score the same interaction independently
  • Fast and cheap for most cases
Tier 2: The Expert Arbiter
  • Only called when the team disagrees (high variance in scores)
  • Or when any judge reports low confidence
  • A more powerful model (like gemini-2.5-pro) reviews everything
  • Makes the final decision with full context
The key insight: Most cases are clear-cut and don’t need the expensive expert. When there’s genuine ambiguity, escalate to the specialist.

When Escalation Happens

The system escalates to Tier 2 when either:
  • High disagreement: Standard deviation of scores exceeds the threshold (default: 0.15)
  • Low confidence: Any judge reports high uncertainty (default: >0.3)
Otherwise, it uses the most confident judgment from Tier 1—saving both time and money.

Session-Level Evaluation

The reward system evaluates the complete trajectory after execution finishes. It derives 2-3 weighted principles tailored to the specific session, scores the trajectory against those principles, and extracts behavioral patterns for future learning.

How It Works

The evaluator receives the full session context—task, plan, steps, final answer, execution mode—and generates a structured evaluation:
{
  "principles": [
    {"name": "Correctness", "weight": 0.5, "description": "Final deliverable matches requirements"},
    {"name": "Safety", "weight": 0.3, "description": "No policy violations detected"},
    {"name": "Efficiency", "weight": 0.2, "description": "Minimal retries needed"}
  ],
  "score": 0.85,
  "rationale": "Response solves the task correctly with efficient execution",
  "uncertainty": 0.1,
  "student_learning": "For straightforward tasks, proceed directly to solution without exploratory steps",
  "teacher_learning": null
}
Key components:
  • Principles: Domain-relevant evaluation criteria with weights (sum to 1.0)
  • Score: Aggregated result in [0.0, 1.0] range
  • Rationale: Explanation grounded in the principles
  • Student learning: Cross-domain behavioral pattern to remember (not task-specific content)
  • Teacher learning: Pedagogical strategy that worked (when teacher provided guidance)
This makes every score fully auditable—you can see which principles were applied and why the judgment was made.

Defining Domain Objectives

The judge prompt system lets you express quality criteria in natural language without training custom models. The evaluator derives 2-3 weighted principles tailored to each trajectory, scores against those principles, and reconciles multiple judge opinions through the ensemble flow. How it works: The focus_prompt field in adaptive_teaching.reward accepts arbitrary evaluation criteria. The judge reads that prompt, generates domain-relevant principles (e.g., “Correctness: 0.5 weight”, “Safety: 0.3 weight”), evaluates the trajectory, and extracts behavioral patterns to store as learning memory.

Configuration Essentials

The reward system is configured via YAML, but you only need to understand a few key settings:

Core Settings

# configs/rim_config.yaml
rim:
  # Diversity: More temperatures = more diverse initial opinions
  temperatures: [0.2, 0.5, 0.8]

  # Escalation sensitivity
  variance_threshold: 0.15  # Lower = more escalations to expert

  # Which dimensions to evaluate
  active_judges:
    accuracy: true
    helpfulness: true
    process: true
    diagnostic: true

Key Tuning Knobs

Want more precision?
  • Lower variance_threshold to 0.10 → More cases go to the expert model
Need faster/cheaper evaluation?
  • Raise variance_threshold to 0.20 → Trust the initial team more often
  • Reduce temperatures to [0.3, 0.7] → Fewer ensemble members
Different use cases?
  • Adjust variance_threshold and uncertainty_threshold to control escalation frequency
  • Use focus_prompt to steer evaluation criteria toward specific domain objectives

Reward System in the Atlas SDK

The SDK runtime uses the same reward philosophy to control its execution loop. The rim block in configs/examples/openai_agent.yaml wires up the scorekeepers and escalation model:
# configs/examples/openai_agent.yaml
rim:
  small_model:
    provider: google
    model: gemini/gemini-2.5-flash
    api_key_env: GEMINI_API_KEY
    max_output_tokens: 8096
  large_model:
    provider: google
    model: gemini/gemini-2.5-flash
    api_key_env: GEMINI_API_KEY
    max_output_tokens: 8096
  judge_prompt: 'reward the agent for attending the issues mentioned in the task'
  variance_threshold: 0.15
  uncertainty_threshold: 0.3
During orchestration, this configuration tells the runtime how to behave:
  1. After execution completes, the session trajectory is evaluated using the small model at multiple temperatures for diverse perspectives.
  2. If variance across samples exceeds 0.15 or any sample reports uncertainty > 0.3, the system escalates to the large model arbiter.
  3. The final reward includes derived principles, score, rationale, and extracted learning patterns (student_learning, teacher_learning).
  4. The reward informs retry decisions and learning memory—patterns are stored for future sessions.
Want stricter quality control? Lower variance_threshold to increase arbiter usage. See the SDK Configuration Reference for complete syntax.
This mirrors the training world: the runtime uses rewards to keep the agent on track, while the training process uses the same signals to improve the underlying models.

Implementation

For practical usage guides, see Reward System Implementation:
  • Integrate with GRPO training
  • Run ad-hoc evaluations
  • Customize judges and evaluation criteria
  • Monitor reward metrics during training

Next Steps

References