Skip to main content

Architectural Overview

ATLAS implements a revolutionary hybrid learning architecture that fundamentally reimagines how AI systems acquire and transfer knowledge. Instead of training separate models for every business domain (requiring massive datasets that rarely exist), we train a Teacher model to master reasoning in mathematics - where logic is clear and data is abundant - then transfer those reasoning skills to solve any business problem. This cross-domain learning breakthrough means a model trained exclusively on math problems can guide agents through CRM workflows, telecom debugging, or any complex business task without domain-specific training. It’s not about teaching facts; it’s about teaching how to think.

The Two-Phase Paradigm

Phase 1: Offline Foundation Training

Offline training establishes deep, generalizable skills through reinforcement learning:
Offline RL Training (24-48 hours)
├── SFT Warmup: Base reasoning capabilities
├── GRPO Training: Adaptive teaching skills
└── Output: Teacher model with foundational knowledge
Key Characteristics:
  • Compute-intensive: Minimum 2 GPUs (1 for vLLM, 1 for training)
  • High-quality teaching examples: ~900 carefully curated adaptive dual-agent demonstrations from Arc-ATLAS-Teach-v0
  • Math-trained foundation: Teacher model trained as expert in reasoning, sequential thinking, and complex problem decomposition
  • Cross-domain transfer: Math-trained reasoning generalizes to debugging, coding, and other analytical tasks
  • One-time cost: Amortized over all deployments

Phase 2: Runtime Continual Learning (SDK)

The atlas-sdk runtime adapts pre-trained teachers to specific tasks between GRPO training runs:
Runtime Loop (continuous)
├── Task Analysis: Identify performance gaps via rewards
├── Experimentation: Adjust teaching prompts and strategies
├── Trace Export: Capture high-signal interactions
└── Output: Data for the next GRPO cycle + incremental runtime improvements
Key Characteristics:
  • Lightweight: Runs through managed APIs in the SDK
  • Rapid: Improves over hours instead of full retraining cycles
  • Safe: Maintains non-degradation guarantee via reward guardrails
  • Continuous: Feeds fresh traces into the next offline training job

Technical Implementation

Offline Training Pipeline

The offline phase uses GRPO (Group Relative Policy Optimization) with the following objective:
def grpo_loss(logits, rewards, reference_logits, beta=0.04):
    """
    GRPO loss combining reward maximization with KL constraint

    Args:
        logits: Current policy outputs
        rewards: Performance improvements from teaching
        reference_logits: SFT baseline outputs
        beta: KL divergence coefficient
    """
    policy_logprobs = F.log_softmax(logits, dim=-1)
    reference_logprobs = F.log_softmax(reference_logits, dim=-1)

    # Reward-weighted policy gradient
    pg_loss = -(rewards * policy_logprobs).mean()

    # KL divergence constraint
    kl_loss = F.kl_div(policy_logprobs, reference_logprobs, reduction='batchmean')

    return pg_loss + beta * kl_loss

Runtime Continual Learning Loop (SDK)

In production, the atlas-sdk runtime constantly evaluates teacher guidance, captures deltas, and exports high-signal traces. Those traces feed both short-term adjustments (prompt/runtime tuning) and the next GRPO training cycle. As a result, the hybrid system keeps improving without pausing deployment.

Empirical Validation

Performance Comparison

Training ApproachTime to DeployPerformance GainCostGeneralization
Fine-tuning1-2 weeks+10-15%$1000sPoor
Few-shot promptingMinutes+3-5%~$1Limited
ATLAS Runtime + GRPOHours (depends on export + training)+15.7% baselineAPI + GPU spendExcellent
*With pre-trained teacher models
Baseline figures reflect the runtime dual-agent loop (student + verifying teacher) plus GRPO training. Online continual learning now lives in the atlas-sdk runtime.

Case Study: Validated Cross-Domain Transfer

Our approach’s effectiveness is validated across multiple benchmarks: Mathematics → Telecom (τ²-bench):
  • Teacher trained only on math problems (Arc-ATLAS-Teach-v0)
  • Applied to telecom troubleshooting without any telecom training
  • Result: 24.0% pass@1 (vs 18.0% for GPT-4.1 and Claude 3.7)
Mathematics → CRM (CRMArena-Pro):
  • Same math-trained teacher
  • Applied to policy compliance tasks
  • Result: 54% task completion (vs ~35% for leading models)
The key insight: The Teacher learns problem decomposition and systematic reasoning from mathematics, skills that transfer universally to any domain requiring analytical thinking.

Theoretical Foundation: Cross-Domain Learning

The Revolutionary Insight

Traditional approaches require massive datasets for every business domain - data that rarely exists. Our breakthrough is teaching an agent the foundational skill of reasoning itself using mathematics, where logic principles are clear and data is abundant, then transferring that skill to solve any business problem. This cross-domain learning addresses the fundamental constraint in enterprise AI: the scarcity of high-quality, in-domain preference data for complex business tasks.

Why Mathematics as the Foundation?

Mathematics was chosen deliberately as the training domain because:
  • Clear correctness: Unlike business tasks, math has verifiable ground truth
  • Abundant data: Thousands of well-structured problems available
  • Pure reasoning: Requires systematic thinking, problem decomposition, and logical flow
  • Complexity gradient: From simple arithmetic to AIME-level competition problems
Our Teacher model trained on ~7,000 math problems achieved 46% accuracy on AIME-25, placing it at top-10 SOTA level and validating its deep reasoning capabilities.

The Cross-Domain Transfer Mechanism

The magic happens when this math-trained reasoning transfers to business domains:
  1. Fundamental Skills Transfer: Problem decomposition, logical sequencing, and systematic thinking learned in math apply universally
  2. Domain-Agnostic Reasoning: The Teacher generates “thinking traces” - step-by-step reasoning guides that work regardless of domain
  3. No Domain Fine-tuning Required: The Student agent uses these traces without needing business-specific training

Empirical Proof of Transfer

Our results demonstrate unprecedented cross-domain transfer:
  • Math → CRM: 54% task completion on CRMArena-Pro (vs ~35% for leading models)
  • Math → Telecom: 24% pass@1 on τ²-bench (vs 18% for GPT-4.1 and Claude)
  • Critical Accuracy: 69.2% accuracy identifying policy violations when present
This proves that deep reasoning skills learned in mathematics create a universal problem-solving capability.

Compounding Intelligence

The hybrid architecture enables “Compounding Intelligence” through:
  1. Skill Accumulation: Each task creates reusable knowledge
  2. Transfer Learning: Skills generalize to related problems
  3. Continuous Improvement: Performance increases with deployment

Implementation Guide

Setting Up Hybrid Training

1

Offline Foundation

Train or download pre-trained teacher models:
# Option 1: Use pre-trained
huggingface-cli download Arc-Intelligence/ATLAS-8B-Thinking

# Option 2: Train custom (minimum 2 GPUs)
scripts/launch.sh 2 configs/run/teacher_sft.yaml
scripts/launch_with_server.sh 1 1 configs/run/teacher_rcl.yaml
2

Run GRPO on Exported Traces

Launch the one-touch offline pipeline once the SDK has exported traces:
python scripts/run_offline_pipeline.py \
  --export-path traces/runtime.jsonl \
  --wandb-project atlas-production \
  --wandb-run-name runtime-to-grpo
3

Deploy Enhanced Model

Point the SDK runtime at the new teacher checkpoint and redeploy:
# atlas-sdk runtime config excerpt
teacher:
  llm:
    provider: huggingface
    model: /models/atlas-teacher-grpo
    temperature: 0.2

Advantages Over Alternatives

vs. Pure Online Learning

  • More stable: Offline foundation prevents catastrophic forgetting
  • More efficient: Reuses learned skills across tasks
  • More general: Transfers to unseen domains

vs. Pure Offline Training

  • More adaptive: Quickly specializes for new tasks
  • Lower cost: Minimal compute for deployment
  • Continuous improvement: Learns from production data

Next Steps

References