The Two-Tier System

Tier 1: Fast ensemble evaluation → Tier 2: Expert arbiter when needed
How It Works
Tier 1: The Initial Team- Multiple efficient models (like
gemini-2.5-flash) run in parallel - Each runs at different temperatures for diverse perspectives
- They all score the same interaction independently
- Fast and cheap for most cases
- Only called when the team disagrees (high variance in scores)
- Or when any judge reports low confidence
- A more powerful model (like
gemini-2.5-pro) reviews everything - Makes the final decision with full context
When Escalation Happens
The system escalates to Tier 2 when either:- High disagreement: Standard deviation of scores exceeds the threshold (default: 0.15)
- Low confidence: Any judge reports high uncertainty (default: >0.3)
Session-Level Evaluation
The reward system evaluates the complete trajectory after execution finishes. It derives 2-3 weighted principles tailored to the specific session, scores the trajectory against those principles, and extracts behavioral patterns for future learning.How It Works
The evaluator receives the full session context—task, plan, steps, final answer, execution mode—and generates a structured evaluation:- Principles: Domain-relevant evaluation criteria with weights (sum to 1.0)
- Score: Aggregated result in [0.0, 1.0] range
- Rationale: Explanation grounded in the principles
- Student learning: Cross-domain behavioral pattern to remember (not task-specific content)
- Teacher learning: Pedagogical strategy that worked (when teacher provided guidance)
Defining Domain Objectives
The judge prompt system lets you express quality criteria in natural language without training custom models. The evaluator derives 2-3 weighted principles tailored to each trajectory, scores against those principles, and reconciles multiple judge opinions through the ensemble flow. How it works: Thefocus_prompt field in adaptive_teaching.reward accepts arbitrary evaluation criteria. The judge reads that prompt, generates domain-relevant principles (e.g., “Correctness: 0.5 weight”, “Safety: 0.3 weight”), evaluates the trajectory, and extracts behavioral patterns to store as learning memory.
Configuration Essentials
The reward system is configured via YAML, but you only need to understand a few key settings:Core Settings
Key Tuning Knobs
Want more precision?- Lower
variance_thresholdto 0.10 → More cases go to the expert model
- Raise
variance_thresholdto 0.20 → Trust the initial team more often - Reduce
temperaturesto[0.3, 0.7]→ Fewer ensemble members
- Adjust
variance_thresholdanduncertainty_thresholdto control escalation frequency - Use
focus_promptto steer evaluation criteria toward specific domain objectives
Reward System in the Atlas SDK
The SDK runtime uses the same reward philosophy to control its execution loop. Therim block in configs/examples/openai_agent.yaml wires up the scorekeepers and escalation model:
- After execution completes, the session trajectory is evaluated using the small model at multiple temperatures for diverse perspectives.
- If variance across samples exceeds 0.15 or any sample reports uncertainty > 0.3, the system escalates to the large model arbiter.
- The final reward includes derived principles, score, rationale, and extracted learning patterns (student_learning, teacher_learning).
- The reward informs retry decisions and learning memory—patterns are stored for future sessions.
Want stricter quality control? Lower
variance_threshold to increase arbiter usage. See the SDK Configuration Reference for complete syntax.Using the Reward System
In Training (Offline RL)
The reward system integrates seamlessly with the GRPO trainer:For Ad-hoc Evaluation
Quick evaluation of teaching effectiveness:In Continual Learning
In the SDK runtime, the same reward signals drive continual learning loops and help teams decide when to export traces for GRPO training. See theatlas-sdk documentation for details on wiring reward feedback into production orchestration.
Customizing Judges
Modifying Existing Judges
Judge behavior is controlled by their prompts inRIM/judges.py. To change what AccuracyJudge prioritizes:
Adding a New Judge
Step 1: Create judge class (RIM/judges.py):
RIM/reward_adapter.py):
configs/rim_config.yaml):
Performance
RewardBench V2 Results
The ensemble-and-escalation architecture achieves 93.7% overall accuracy, significantly outperforming individual models:- Component model (
gemini-2.5-flash): 77.7% on its own - System performance: 93.7% (+16 points)

Category Breakdown

Monitoring Rewards During Training
The training logs include reward system outputs:- Spot prompt regressions (dropping helpfulness scores)
- Identify misconfigured thresholds (escalation rate too high/low)
- Validate teaching improvements (rising scores over time)
Next Steps
Adaptive Dual-Agent Reasoning
See how the dual-agent workflow and lane logic operate
GRPO Training
Use the reward system to train teacher models
SDK Runtime
See how rewards flow through the production loop
References
- Reward System Technical Report - Complete methodology and benchmarks
- ATLAS Technical Report - How rewards integrate with training
- RewardBench V2 - Benchmark leaderboard