Skip to main content
Atlas ships three lightweight harnesses so you can quantify runtime performance and learning momentum without building custom analytics. Each harness reads directly from persisted telemetry, emits machine-readable artifacts, and mirrors the metrics the Atlas team uses internally.
HarnessScriptPrimary QuestionsKey Artifacts
Learningscripts/eval_learning.pyAre playbooks improving reward and execution modes for each learning_key?results/learning/*.json, *.md, index.json
Runtimescripts/eval_dual_agent_models.pyWhich student/teacher pairings deliver the best reward vs latency?results/dual_agent_eval.json
Rewardscripts/eval_reward_models.pyHow do judge stacks compare on reward, uncertainty, and escalation?results/reward/*.json

Before You Run the Harnesses

  • Ensure Postgres persistence is enabled (storage.database_url) so sessions, discovery runs, and learning registry entries are available.
  • Seed your .env with the model API keys required by each harness. The scripts call load_dotenv_if_available() before executing.
  • Review gating defaults to approved sessions only. Approve runs via arc-atlas review or override filters with the CLI options documented below when you intentionally include pending data.

Learning Evaluation Snapshot

Use this harness to validate that runtime playbooks are trending in the right direction.
python scripts/eval_learning.py \
  --database-url postgresql://atlas:atlas@localhost:5433/atlas \
  --recent-window 10 \
  --baseline-window 50 \
  --limit 5
Key options
  • --learning-key <id> – focus on explicit learning keys instead of the most recent ones.
  • --filter-project, --filter-task, --filter-tag tenant:demo – scope by metadata saved in session telemetry.
  • --summary-only – skip per-session telemetry fetches for fast CI checks.
  • --compare-to results/learning/index.json – compute deltas against a previous run.
  • --no-markdown – emit JSON only for automation.
Metrics reported
  • Reward momentum (recent vs baseline means, deltas, uncertainty).
  • Execution-mode distribution (auto, paired, coach, escalate) per window.
  • Review status counts so you can spot unreviewed spikes.
  • Model breakdown (student/teacher session counts and rewards) extracted from adapter telemetry.
  • Discovery references tying each learning key back to the onboarding metadata in discovery_runs.
The harness produces per-key JSON + Markdown files and an index.json manifest you can check into CI artifacts or use as input for comparison runs.

Dual-Agent Runtime Evaluation

Benchmark student/teacher combinations against the synthetic runtime dataset.
python scripts/eval_dual_agent_models.py \
  --dataset atlas/data/synthetic_runtime_tasks.jsonl \
  --student-models claude-haiku-4-5 gemini-2.5-flash \
  --teacher-models grok-4-fast gemini-2.5-pro \
  --repeats 2 \
  --output results/dual_agent_eval.json
Key options
  • --base-config – clone a different runtime config (default: configs/examples/openai_agent.yaml).
  • --concurrency – process pool fan-out for faster sweeps; keep to 1 when troubleshooting logging.
  • --output – JSON summary combining per-run records with aggregated stats.
  • ATLAS_MODEL_OVERRIDE_<MODEL_ID> – redirect presets to alternate endpoints (e.g., Azure-hosted GPT).
  • ATLAS_MODEL_TIMEOUT – raise/lower call timeouts globally during slow endpoints.
Metrics reported
  • Per-task final answers, adaptive-mode history, reward, runtime, and failure flag.
  • Aggregated reward averages, latency means, failure counts, and mode distributions per model pair.
  • “Best pair” heuristic for quick default selection plus raw telemetry for deeper analysis.
Use this harness whenever you refresh preferred providers or want evidence before promoting a new default to production.

Reward Judge Evaluation

Replay captured session trajectories to compare judge stacks without disturbing the orchestrator.
python scripts/eval_reward_models.py \
  --dataset atlas/data/reward_eval_trajectories.jsonl \
  --judge-combos gemini_pair claude_stack grok_stack \
  --baseline gemini_pair \
  --collect-audit \
  --output results/reward/latest.json
Key options
  • --dataset – supply additional trajectories collected with scripts.capture_reward_trajectories.
  • --repeats – multiple passes to quantify variance.
  • --concurrency – concurrent judge evaluations per combo.
  • --baseline – reference stack used for deltas and correlation metrics.
  • --collect-audit – include serialized judge responses and Markdown summaries alongside the JSON artifact.
  • --config configs/eval/reward_system.yaml – point at the editable preset file if you customise judge combos without changing code (falls back to in-repo defaults).
Metrics reported
  • Reward mean, standard deviation, and uncertainty per combo.
  • Escalation rate and failure count (helps spot brittle judge stacks).
  • Latency statistics (average, median, p95).
  • Agreement signals: delta vs baseline and Pearson correlation.
The harness now emits both JSON and Markdown reports when --collect-audit is supplied. The Markdown files highlight per-judge behaviour and link directly to captured audit payloads, making it easier to review outliers during incident postmortems.

Operational Tips

  • Version control – treat the JSON outputs as experimental telemetry; archive them in object storage or attach to CI artifacts rather than committing to the repo.
  • Review gating – align harness filters with your review policy. Learning evaluation defaults to all sessions, but your pipeline should mimic the arc-atlas export filters you plan to use for training.
  • Automation – add these scripts to nightly jobs. Make the output directory configurable via --output-dir (learning) so each pipeline run lands in its own folder.
  • Judge preset config – copy or edit configs/eval/reward_system.yaml to curate judge combos, latency budgets, or model overrides without touching scripts/eval_reward_models.py. The CLI will pick up your file automatically when passed via --config.
I