Abstract
This case study addresses the “Judgment Gap”—the critical difference between an agent’s ability to reason and its ability to implement a solution in a collaborative, dual-control environment. By applying the ATLAS framework to the complexmms_issue
tasks in the τ²-bench benchmark, we demonstrate a 24.0% pass@1 rate, outperforming GPT-4.1 by 33% and establishing a new state-of-the-art for agent reliability.
24.0% Pass@1 Rate
New state-of-the-art on τ²-bench
mms_issue
tasks, a benchmark for dual-control agent reliability.6x Performance Lift
The teacher-student framework provided a nearly 6x performance improvement over the student-only baseline (4.1% pass@1).
The Challenge: The Judgment Gap
Real-world workflows are not static. They are dynamic, dual-control environments where both an AI agent and a human user can act. Standard benchmarks that test pure reasoning in isolation fail to capture the primary source of agent failure: a lack of collaborative judgment. τ²-bench models this challenge by creating scenarios where agent success depends on guiding a user through a series of actions. This moves evaluation from an agent that can do to an agent that can guide.Methodology: Cross-Domain Learning
The ATLAS framework employs a Teacher-Student architecture to instill judgment:- Teacher Model (
Arc-Intelligence/arc-teacher-8b
): An expert in a separate, complex domain (mathematics). - Student Model (
Qwen/Qwen3-8B
): A generalist model that executes tasks.