Core Concept
- Student (your agent) – Any LLM or tool stack executing the task (GPT, Claude, Gemini, local checkpoints, custom code).
- Verifying teacher – An 8B specialized model trained to diagnose gaps, inject guidance, and certify answers.
- Outcome – Better answers, higher safety, and richer telemetry—without retraining the underlying agent.
Why it’s model agnostic
Traditional approach | Adaptive dual-agent reasoning |
---|---|
Retrain or fine-tune the agent | Keep the agent frozen and add a verifying teacher |
Requires model access & GPUs | Works with API-only models |
Risk of regression | Preserves baseline capabilities |
Weeks to deploy updates | Hours to roll out new teacher checkpoints |
Student & Verifying Teacher Roles
for the runtime personas
and Offline Training
to learn how new teachers are trained.
Why it works
1. Asymmetric specialization
The teacher focuses solely on teaching. It doesn’t need to outperform the agent at solving tasks; it needs to spot blind spots, orchestrate retries, and provide precise interventions.Analogy: A senior reviewer doesn’t code faster than the whole team—they prevent critical mistakes, guide architecture decisions, and approve releases.
2. Inference-time enhancement
Guidance happens through prompts at runtime:- The teacher inspects the task, telemetry, and prior attempts.
- It scores capability, triages risk, and drafts guidance.
- The guidance is merged into the agent’s context.
- The agent re-runs with the added teaching and produces the final answer.
3. Adaptive intensity
The teacher adjusts effort based on confidence:- High confidence: Light-touch verification or a single checklist.
- Medium confidence: Paired review of the final result.
- Low confidence: Step-by-step coaching with retries.
Deployment requirements
Component | Specification | Purpose |
---|---|---|
Verifying teacher | 8B RL-trained model | Generates adaptive guidance & certifications |
Student agent | Any size / provider | Executes the actual work |
Context window | 4k–32k tokens | Accommodates guidance + agent output |
Latency overhead | ~30% | Extra pass for analysis and teaching |
Performance snapshot (τ²-bench, mms_issue subset)
System | Pass@1 | Notes |
---|---|---|
ATLAS dual-agent | 24.0% | Minimal degradation across retries |
GPT-4.1 | 18.0% | −8 pts from Pass@1 to Pass@4 |
Claude 3.7 Sonnet | 18.0% | −16 pts drop across retries |
o4-mini | 12.0% | −10 pts drop across retries |
Qwen3-8B (student only) | 4.1% | No teacher guidance |
- 6× lift on the same student by adding the verifying teacher.
- Stable retries: The teacher keeps success rates high on subsequent attempts.
- Cross-domain transfer: A math-trained teacher can supervise telecom debugging tasks because it enforces process, not domain answers.
Aggregate impact
- Average accuracy gain (runtime + GRPO): +15.7 %
- Maximum domain lift: +29.6 %
- Non-degradation rate: 97 %
- Token efficiency: ~50 % reduction
- Completion rate: +31 %
Benefits vs. other approaches
Fine-tuning / RLHF
- No retraining or weight access required.
- Zero risk of catastrophic forgetting.
- Deploy new guidance in hours, not weeks.
Prompt engineering
- Adaptive, not one-shot tuning.
- Systematic and measurable improvements.
- Token use scales with confidence (fast lanes stay cheap).
Ensembles
- Single agent executes the work; no multi-model voting.
- Lower cost and latency for comparable quality.
- Guidance is inspectable and auditable.
Training the verifying teacher
- Supervised warmup (SFT) – Teach baseline review behaviors (4–6 hours on 8× H100).
- GRPO fine-tuning – Optimize for student improvement and calibrated confidence (24–36 hours on 8× H100).
Integration patterns
SDK runtime orchestration
Integrate via the SDK to wrap your agent with adaptive lanes (auto, paired, coach, escalate). The runtime logs every teaching decision and reward signal for later analysis.Offline GRPO training
Export runtime traces (arc-atlas --output traces.jsonl
) and run python scripts/run_offline_pipeline.py --export-path <traces>.jsonl
to produce updated teacher checkpoints. Point the SDK back at the new weights to close the loop.
Best practices
Selecting the teacher checkpoint
Selecting the teacher checkpoint
- ATLAS-8B-Thinking: Analytical, math-heavy domains
- ATLAS-8B-Instruct: Code generation, structured workflows
- Custom GRPO: Train on your exported traces for domain-specific oversight
Qualifying student agents
Qualifying student agents
Confirm the agent supports:
- System prompts or instruction conditioning
- Sufficient context window (>4k tokens)
- Deterministic decoding (control temperature/top_p)
Operational tips
Operational tips
- Cache teacher guidance for repetitive tickets
- Batch similar tasks for throughput
- Stream intermediate verdicts for human-on-the-loop monitoring
- Track token budgets per lane to manage cost
Next steps
Runtime Orchestration
Dive into the lane logic, telemetry, and orchestration flow.
Deploy to Production
Learn how to operate the dual-agent loop in production infrastructure.
Reward System
Understand how guidance quality is quantified.
Performance Benchmarks
Explore τ²-bench results in detail.
Offline Training
Run SFT + GRPO to evolve your verifying teacher.
References
- ATLAS Technical Report — Architecture and evaluations
- Offline Training Guide — Hands-on teacher training walkthrough
- SRE Case Study — Applying the loop to incident management