| Harness | Script | Primary Questions | Key Artifacts |
|---|---|---|---|
| Learning | scripts/eval_learning.py | Are playbooks improving reward and execution modes for each learning_key? | results/learning/*.json, *.md, index.json |
| Runtime | scripts/eval_dual_agent_models.py | Which student/teacher pairings deliver the best reward vs latency? | results/dual_agent_eval.json |
| Reward | scripts/eval_reward_models.py | How do judge stacks compare on reward, uncertainty, and escalation? | results/reward/*.json |
Before You Run the Harnesses
- Ensure Postgres persistence is enabled (
storage.database_url) so sessions, discovery runs, and learning registry entries are available. - Seed your
.envwith the model API keys required by each harness. The scripts callload_dotenv_if_available()before executing. - Review gating defaults to
approvedsessions only. Approve runs viaarc-atlas reviewor override filters with the CLI options documented below when you intentionally include pending data.
Learning Evaluation Snapshot
Use this harness to validate that runtime playbooks are trending in the right direction.--learning-key <id>– focus on explicit learning keys instead of the most recent ones.--filter-project,--filter-task,--filter-tag tenant:demo– scope by metadata saved in session telemetry.--summary-only– skip per-session telemetry fetches for fast CI checks.--compare-to results/learning/index.json– compute deltas against a previous run.--no-markdown– emit JSON only for automation.
- Reward momentum (recent vs baseline means, deltas, uncertainty).
- Execution-mode distribution (
auto,paired,coach,escalate) per window. - Review status counts so you can spot unreviewed spikes.
- Model breakdown (student/teacher session counts and rewards) extracted from adapter telemetry.
- Discovery references tying each learning key back to the onboarding metadata in
discovery_runs.
index.json manifest you can check into CI artifacts or use as
input for comparison runs.
Dual-Agent Runtime Evaluation
Benchmark student/teacher combinations against the synthetic runtime dataset.--base-config– clone a different runtime config (default:configs/examples/openai_agent.yaml).--concurrency– process pool fan-out for faster sweeps; keep to 1 when troubleshooting logging.--output– JSON summary combining per-run records with aggregated stats.ATLAS_MODEL_OVERRIDE_<MODEL_ID>– redirect presets to alternate endpoints (e.g., Azure-hosted GPT).ATLAS_MODEL_TIMEOUT– raise/lower call timeouts globally during slow endpoints.
- Per-task final answers, adaptive-mode history, reward, runtime, and failure flag.
- Aggregated reward averages, latency means, failure counts, and mode distributions per model pair.
- “Best pair” heuristic for quick default selection plus raw telemetry for deeper analysis.
Reward Judge Evaluation
Replay captured session trajectories to compare judge stacks without disturbing the orchestrator.--dataset– supply additional trajectories collected withscripts.capture_reward_trajectories.--repeats– multiple passes to quantify variance.--concurrency– concurrent judge evaluations per combo.--baseline– reference stack used for deltas and correlation metrics.--collect-audit– include serialized judge responses and Markdown summaries alongside the JSON artifact.--config configs/eval/reward_system.yaml– point at the editable preset file if you customise judge combos without changing code (falls back to in-repo defaults).
- Reward mean, standard deviation, and uncertainty per combo.
- Escalation rate and failure count (helps spot brittle judge stacks).
- Latency statistics (average, median, p95).
- Agreement signals: delta vs baseline and Pearson correlation.
--collect-audit is supplied. The Markdown files highlight
per-judge behaviour and link directly to captured audit payloads, making it easier to review outliers during incident
postmortems.
Operational Tips
- Version control – treat the JSON outputs as experimental telemetry; archive them in object storage or attach to CI artifacts rather than committing to the repo.
- Review gating – align harness filters with your review policy. Learning evaluation defaults to all sessions, but
your pipeline should mimic the
arc-atlas exportfilters you plan to use for training. - Automation – add these scripts to nightly jobs. Make the output directory configurable via
--output-dir(learning) so each pipeline run lands in its own folder. - Judge preset config – copy or edit
configs/eval/reward_system.yamlto curate judge combos, latency budgets, or model overrides without touchingscripts/eval_reward_models.py. The CLI will pick up your file automatically when passed via--config.
Related Reading
Learning System Architecture– how playbooks are synthesized and stored.Runtime Safety & Review– guardrails that influence which sessions enter the harnesses.Database Schema– table-level reference for the telemetry each harness pulls.