Evaluation Harnesses

Atlas ships dedicated evaluation harnesses in the atlas-sdk repository. They operate directly on the SDK runtime (learning reports, synthetic runtime sweeps, and reward judge comparisons) while Atlas Core focuses on offline training. Clone or open the SDK repo alongside this project and run the scripts from the SDK root.

Harness	SDK Script	Primary Questions	Key Artifacts
Learning snapshot	`scripts/report_learning.py`	Are playbooks improving reward and execution modes for each `learning_key`?	`atlas-sdk/results/learning/.json`, `.md`, `index.json`
Runtime benchmarking	`scripts/benchmark_dual_agent_models.py`	Which student/teacher pairings deliver the best reward vs latency?	`atlas-sdk/results/dual_agent_eval.json`
Reward benchmarking	`scripts/benchmark_reward_models.py`	How do judge stacks compare on reward, uncertainty, and escalation?	`atlas-sdk/results/reward/*.json` (+ optional Markdown)

Before You Run the Harnesses

Enable Postgres persistence in your SDK config (storage.database_url) so the scripts can read sessions, discovery runs, and learning registry entries.
Load .env with the provider API keys required by each harness—the SDK scripts call load_dotenv_if_available() before executing.
Review gating defaults to approved sessions only. Approve runs via arc-atlas review or override filters explicitly when you intend to include pending/quarantined data.
Run the commands from the atlas-sdk repo root (for example cd ../atlas-sdk if you keep both repos side-by-side).

Learning Snapshot (`atlas-sdk/scripts/report_learning.py`)

Use the reporting harness to evaluate playbook health without injecting hints:

cd ../atlas-sdk
python scripts/report_learning.py \
  --database-url postgresql://atlas:atlas@localhost:5433/atlas \
  --recent-window 10 \
  --baseline-window 50 \
  --limit 5 \
  --output-dir results/learning

Key options

--learning-key <id> – report on explicit keys rather than the most recent ones.
--filter-project, --filter-task, --filter-tag tenant:demo – scope by telemetry metadata.
--summary-only – skip trajectory fetches for CI.
--compare-to results/learning/index.json – compute deltas against a previous run.
--no-markdown – emit JSON only.

Metrics reported

Recent vs baseline reward means, deltas, and uncertainty.
Execution-mode distribution (auto/paired/coach) for each window.
Review status counts so pending/quarantined sessions surface immediately.
Model usage breakdown (student/teacher pairings drawn from adapter telemetry).
Discovery references tying each learning key back to discovery_runs.

Outputs live under atlas-sdk/results/learning/ (per-key JSON, optional Markdown, and a manifest). For lightweight spot checks inside Atlas Core you can still call atlas.training_data.get_training_sessions directly—see the snippet in the Learning System guide—but the SDK harness is the canonical workflow.

Runtime Benchmarking (`atlas-sdk/scripts/benchmark_dual_agent_models.py`)

Benchmark student/teacher combinations against the synthetic runtime dataset and capture latency + reward deltas:

cd ../atlas-sdk
python scripts/benchmark_dual_agent_models.py \
  --dataset atlas/data/synthetic_runtime_tasks.jsonl \
  --student-models claude-haiku-4-5 gemini-2.5-flash \
  --teacher-models grok-4-fast gemini-2.5-pro \
  --repeats 2 \
  --output results/dual_agent_eval.json

Key options

--base-config – clone a different runtime config (default: configs/examples/openai_agent.yaml).
--concurrency – process-pool fan-out for faster sweeps (set to 1 while debugging logging).
ATLAS_MODEL_OVERRIDE_<MODEL> – redirect presets to hosted checkpoints (Azure OpenAI, self-hosted vLLM, etc.).
ATLAS_MODEL_TIMEOUT – raise/lower call timeouts globally when exercising slow providers.

Metrics reported

Per-task final answers, adaptive-mode history, reward, runtime, and failure status.
Aggregated reward averages, latency means, failure counts, and mode distributions per model pair.
“Best pair” heuristic for quick default selection plus raw telemetry for deeper analysis.

Reward Benchmarking (`atlas-sdk/scripts/benchmark_reward_models.py`)

Replay captured session trajectories to compare reward judge stacks without disturbing the orchestrator:

cd ../atlas-sdk
python scripts/benchmark_reward_models.py \
  --dataset atlas/data/reward_eval_trajectories.jsonl \
  --judge-combos gemini_pair claude_stack grok_stack \
  --baseline gemini_pair \
  --collect-audit \
  --output results/reward/latest.json

Key options

--dataset – supply datasets collected with scripts/collect_reward_trajectories.py.
--repeats – multiple passes to quantify variance.
--concurrency – concurrent judge evaluations per combo.
--markdown-output – write Markdown summaries alongside JSON artifacts.
--collect-audit – include serialized prompts/responses for debugging reward prompts.

Metrics reported

Reward mean, standard deviation, and uncertainty per combo.
Escalation rates, failure counts, and agreement vs the baseline stack.
Latency statistics (average, median, p95).

Use this harness whenever you adjust judge prompts/configs in either repo. Treat the outputs as experimental telemetry and archive them with your CI artifacts.

Guarding Reward Schemas Inside Atlas Core

Atlas Core still ships the tests/test_reward_schema.py regression suite to ensure the trainer-side configs stay in sync with SDK telemetry:

pytest tests/test_reward_schema.py -q

It validates:

Reward config _target_ paths and prompt references in configs/reward/ + configs/trainer/reward/.
Schema compatibility with the latest SDK telemetry (per-judge fields, optional blocks, etc.).
Trainer exports (trainers/__init__.py) so downstream imports keep working.

Run it in CI alongside the SDK harnesses above whenever you edit reward prompts or trainer configs.

Learning System Architecture – how playbooks are synthesized and stored.
Runtime Safety & Review – guardrails that influence which sessions enter the harnesses.
Database Schema – table-level reference for the telemetry each harness pulls.

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

Evaluation Harnesses

Before You Run the Harnesses

Learning Snapshot (`atlas-sdk/scripts/report_learning.py`)

Runtime Benchmarking (`atlas-sdk/scripts/benchmark_dual_agent_models.py`)

Reward Benchmarking (`atlas-sdk/scripts/benchmark_reward_models.py`)

Guarding Reward Schemas Inside Atlas Core

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

​Before You Run the Harnesses

​Learning Snapshot (atlas-sdk/scripts/report_learning.py)

​Runtime Benchmarking (atlas-sdk/scripts/benchmark_dual_agent_models.py)

​Reward Benchmarking (atlas-sdk/scripts/benchmark_reward_models.py)

​Guarding Reward Schemas Inside Atlas Core

​Related Reading

Before You Run the Harnesses

Learning Snapshot (`atlas-sdk/scripts/report_learning.py`)

Runtime Benchmarking (`atlas-sdk/scripts/benchmark_dual_agent_models.py`)

Reward Benchmarking (`atlas-sdk/scripts/benchmark_reward_models.py`)

Guarding Reward Schemas Inside Atlas Core

Related Reading