Adaptive Dual-Agent Reasoning

ATLAS hinges on an adaptive dual-agent reasoning loop: your production agent (the student) stays frozen, while a specialized verifying teacher evaluates its plan, provides targeted guidance, and confirms the final answer. The partnership boosts quality without touching model weights and works across any provider—API-only or self-hosted. Think of it as pairing every agent run with an expert reviewer. The reviewer doesn’t replace your agent; it inspects the approach, corrects mistakes, and signs off before results ship to users.

Core Concept

Student (your agent) – Any LLM or tool stack executing the task (GPT, Claude, Gemini, local checkpoints, custom code).
Verifying teacher – An 8B specialized model trained to diagnose gaps, inject guidance, and certify answers.
Outcome – Better answers, higher safety, and richer telemetry—without retraining the underlying agent.

Why it’s model agnostic

Traditional approach	Adaptive dual-agent reasoning
Retrain or fine-tune the agent	Keep the agent frozen and add a verifying teacher
Requires model access & GPUs	Works with API-only models
Risk of regression	Preserves baseline capabilities
Weeks to deploy updates	Hours to roll out new teacher checkpoints

For runtime implementation details, see How Orchestration Works for the runtime personas and Offline Training to learn how new teachers are trained.

Why it works

1. Asymmetric specialization

The teacher focuses solely on teaching. It doesn’t need to outperform the agent at solving tasks; it needs to spot blind spots, orchestrate retries, and provide precise interventions.

Analogy: A senior reviewer doesn’t code faster than the whole team—they prevent critical mistakes, guide architecture decisions, and approve releases.

2. Inference-time enhancement

Guidance happens through prompts at runtime:

The teacher inspects the task, telemetry, and prior attempts.
It scores capability, triages risk, and drafts guidance.
The guidance is merged into the agent’s context.
The agent re-runs with the added teaching and produces the final answer.

No gradient steps, checkpoints, or weight updates are required.

3. Adaptive intensity

The teacher adjusts effort based on confidence:

High confidence: Light-touch verification or a single checklist.
Medium confidence: Paired review of the final result.
Low confidence: Step-by-step coaching with retries.

You pay only for the oversight you need, run-by-run.

Deployment requirements

Component	Specification	Purpose
Verifying teacher	8B RL-trained model	Generates adaptive guidance & certifications
Student agent	Any size / provider	Executes the actual work
Context window	4k–32k tokens	Accommodates guidance + agent output
Latency overhead	~30%	Extra pass for analysis and teaching

Performance snapshot (τ²-bench, mms_issue subset)

System	Pass@1	Notes
ATLAS dual-agent	24.0%	Minimal degradation across retries
GPT-4.1	18.0%	−8 pts from Pass@1 to Pass@4
Claude 3.7 Sonnet	18.0%	−16 pts drop across retries
o4-mini	12.0%	−10 pts drop across retries
Qwen3-8B (student only)	4.1%	No teacher guidance

Key takeaways

6× lift on the same student by adding the verifying teacher.
Stable retries: The teacher keeps success rates high on subsequent attempts.
Cross-domain transfer: A math-trained teacher can supervise telecom debugging tasks because it enforces process, not domain answers.

Aggregate impact

Average accuracy gain (runtime + GRPO): +15.7 %
Maximum domain lift: +29.6 %
Non-degradation rate: 97 %
Token efficiency: ~50 % reduction
Completion rate: +31 %

Benefits vs. other approaches

Fine-tuning / RLHF

No retraining or weight access required.
Zero risk of catastrophic forgetting.
Deploy new guidance in hours, not weeks.

Prompt engineering

Adaptive, not one-shot tuning.
Systematic and measurable improvements.
Token use scales with confidence (fast lanes stay cheap).

Ensembles

Single agent executes the work; no multi-model voting.
Lower cost and latency for comparable quality.
Guidance is inspectable and auditable.

Training the verifying teacher

Supervised warmup (SFT) – Teach baseline review behaviors (4–6 hours on 8× H100).
GRPO fine-tuning – Optimize for student improvement and calibrated confidence (24–36 hours on 8× H100).

Rewards come from measured student gains, so the teacher is incentivized to deliver guidance that genuinely improves outcomes.

Integration patterns

SDK runtime orchestration

Integrate via the SDK to wrap your agent with adaptive lanes (auto, paired, coach). The runtime logs every teaching decision and reward signal for later analysis.

Offline GRPO training

Export runtime traces (arc-atlas --output traces.jsonl) and run python scripts/run_offline_pipeline.py --export-path <traces>.jsonl to produce updated teacher checkpoints. Point the SDK back at the new weights to close the loop.

Best practices

Selecting the teacher checkpoint

ATLAS-8B-Thinking: Analytical, math-heavy domains
ATLAS-8B-Instruct: Code generation, structured workflows
Custom GRPO: Train on your exported traces for domain-specific oversight

Qualifying student agents

Confirm the agent supports:

System prompts or instruction conditioning
Sufficient context window (>4k tokens)
Deterministic decoding (control temperature/top_p)

Operational tips

Cache teacher guidance for repetitive tickets
Batch similar tasks for throughput
Stream intermediate verdicts for human-on-the-loop monitoring
Track token budgets per lane to manage cost

Next steps

Runtime Orchestration

Dive into the lane logic, telemetry, and orchestration flow.

Deploy to Production

Learn how to operate the dual-agent loop in production infrastructure.

Reward System

Understand how guidance quality is quantified.

Production Example

Explore τ²-bench results in detail.

Offline Training

Run SFT + GRPO to evolve your verifying teacher.

References

ATLAS Technical Report — Architecture and evaluations
Offline Training Guide — Hands-on teacher training walkthrough
Adaptive Tool Use — Production example with MCP integration

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

Adaptive Dual-Agent Reasoning

Core Concept

Why it’s model agnostic

Why it works

1. Asymmetric specialization

2. Inference-time enhancement

3. Adaptive intensity

Deployment requirements

Performance snapshot (τ²-bench, mms_issue subset)

Aggregate impact

Benefits vs. other approaches

Fine-tuning / RLHF

Prompt engineering

Ensembles

Training the verifying teacher

Integration patterns

SDK runtime orchestration

Offline GRPO training

Best practices

Next steps

Runtime Orchestration

Deploy to Production

Reward System

Production Example

Offline Training

References

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

​Core Concept

​Why it’s model agnostic

​Why it works

​1. Asymmetric specialization

​2. Inference-time enhancement

​3. Adaptive intensity

​Deployment requirements

​Performance snapshot (τ²-bench, mms_issue subset)

​Aggregate impact

​Benefits vs. other approaches

​Fine-tuning / RLHF

​Prompt engineering

​Ensembles

​Training the verifying teacher

​Integration patterns

​SDK runtime orchestration

​Offline GRPO training

​Best practices

​Next steps

Runtime Orchestration

Deploy to Production

Reward System

Production Example

Offline Training

​References

Core Concept

Why it’s model agnostic

Why it works

1. Asymmetric specialization

2. Inference-time enhancement

3. Adaptive intensity

Deployment requirements

Performance snapshot (τ²-bench, mms_issue subset)

Aggregate impact

Benefits vs. other approaches

Fine-tuning / RLHF

Prompt engineering

Ensembles

Training the verifying teacher

Integration patterns

SDK runtime orchestration

Offline GRPO training

Best practices

Next steps

References