Case Study: SOTA on τ²-bench

Abstract

This case study addresses the “Judgment Gap”—the critical difference between an agent’s ability to reason and its ability to implement a solution in a collaborative, dual-control environment. By applying the ATLAS framework to the complex mms_issue tasks in the τ²-bench benchmark, we demonstrate a 24.0% pass@1 rate, outperforming GPT-4.1 by 33% and establishing a new state-of-the-art for agent reliability.

24.0% Pass@1 Rate

New state-of-the-art on τ²-bench mms_issue tasks, a benchmark for dual-control agent reliability.

6x Performance Lift

The teacher-student framework provided a nearly 6x performance improvement over the student-only baseline (4.1% pass@1).

The Challenge: The Judgment Gap

Real-world workflows are not static. They are dynamic, dual-control environments where both an AI agent and a human user can act. Standard benchmarks that test pure reasoning in isolation fail to capture the primary source of agent failure: a lack of collaborative judgment. τ²-bench models this challenge by creating scenarios where agent success depends on guiding a user through a series of actions. This moves evaluation from an agent that can do to an agent that can guide.

Methodology: Cross-Domain Learning

The ATLAS framework employs a Teacher-Student architecture to instill judgment:

Teacher Model (Arc-Intelligence/arc-teacher-8b): An expert in a separate, complex domain (mathematics).
Student Model (Qwen/Qwen3-8B): A generalist model that executes tasks.

The teacher provides high-level “thinking traces” to the student, demonstrating that abstract reasoning skills can be transferred across domains to guide execution in a completely different context (telecom).

Example: Teacher’s Strategic Trace

For a complex MMS issue, the teacher generates a strategic plan:

<thinking>  
Okay, let's tackle this problem step by step. The user...is having trouble sending MMS messages...while he's in France... The required actions are a mix of network settings, app permissions, data refueling, SIM card, roaming, and toggling...

First, MMS issues can often be related to network settings...
Next, toggling Wi-Fi calling...
Then, granting SMS and storage permissions...
Enabling roaming...
</thinking>

This strategic guidance allows the student to formulate a correct, step-by-step execution plan, bridging the gap between reasoning and action.

Real-World Impact

This study provides a clear, quantitative validation that a dedicated “Judgment Layer” is essential for building reliable agents. By separating strategic reasoning (Teacher) from tactical execution (Student), the system can successfully navigate complex, collaborative tasks where other state-of-the-art models fail.

Getting Started

Core Concepts

Examples & Case Studies

Integration

Training

Benchmarks

Architecture

API Reference

Reference

Case Study: SOTA on τ²-bench

Abstract

24.0% Pass@1 Rate

6x Performance Lift

The Challenge: The Judgment Gap

Methodology: Cross-Domain Learning

Example: Teacher’s Strategic Trace

Real-World Impact

Next Steps

Full Technical Breakdown

SRE Case Study

Getting Started

Core Concepts

Examples & Case Studies

Integration

Training

Benchmarks

Architecture

API Reference

Reference

​Abstract

24.0% Pass@1 Rate

6x Performance Lift

​The Challenge: The Judgment Gap

​Methodology: Cross-Domain Learning

​Example: Teacher’s Strategic Trace

​Real-World Impact

​Next Steps

Full Technical Breakdown

SRE Case Study

Abstract

The Challenge: The Judgment Gap

Methodology: Cross-Domain Learning

Example: Teacher’s Strategic Trace

Real-World Impact

Next Steps