| Harness | SDK Script | Primary Questions | Key Artifacts |
|---|---|---|---|
| Learning snapshot | scripts/report_learning.py | Are playbooks improving reward and execution modes for each learning_key? | atlas-sdk/results/learning/*.json, *.md, index.json |
| Runtime benchmarking | scripts/benchmark_dual_agent_models.py | Which student/teacher pairings deliver the best reward vs latency? | atlas-sdk/results/dual_agent_eval.json |
| Reward benchmarking | scripts/benchmark_reward_models.py | How do judge stacks compare on reward, uncertainty, and escalation? | atlas-sdk/results/reward/*.json (+ optional Markdown) |
Before You Run the Harnesses
- Enable Postgres persistence in your SDK config (
storage.database_url) so the scripts can read sessions, discovery runs, and learning registry entries. - Load
.envwith the provider API keys required by each harness—the SDK scripts callload_dotenv_if_available()before executing. - Review gating defaults to
approvedsessions only. Approve runs viaarc-atlas reviewor override filters explicitly when you intend to include pending/quarantined data. - Run the commands from the
atlas-sdkrepo root (for examplecd ../atlas-sdkif you keep both repos side-by-side).
Learning Snapshot (atlas-sdk/scripts/report_learning.py)
Use the reporting harness to evaluate playbook health without injecting hints:
--learning-key <id>– report on explicit keys rather than the most recent ones.--filter-project,--filter-task,--filter-tag tenant:demo– scope by telemetry metadata.--summary-only– skip trajectory fetches for CI.--compare-to results/learning/index.json– compute deltas against a previous run.--no-markdown– emit JSON only.
- Recent vs baseline reward means, deltas, and uncertainty.
- Execution-mode distribution (auto/paired/coach) for each window.
- Review status counts so pending/quarantined sessions surface immediately.
- Model usage breakdown (student/teacher pairings drawn from adapter telemetry).
- Discovery references tying each learning key back to
discovery_runs.
atlas-sdk/results/learning/ (per-key JSON, optional Markdown, and a manifest). For lightweight spot
checks inside Atlas Core you can still call atlas.training_data.get_training_sessions directly—see the snippet in the
Learning System guide—but the SDK harness is the canonical workflow.
Runtime Benchmarking (atlas-sdk/scripts/benchmark_dual_agent_models.py)
Benchmark student/teacher combinations against the synthetic runtime dataset and capture latency + reward deltas:
--base-config– clone a different runtime config (default:configs/examples/openai_agent.yaml).--concurrency– process-pool fan-out for faster sweeps (set to 1 while debugging logging).ATLAS_MODEL_OVERRIDE_<MODEL>– redirect presets to hosted checkpoints (Azure OpenAI, self-hosted vLLM, etc.).ATLAS_MODEL_TIMEOUT– raise/lower call timeouts globally when exercising slow providers.
- Per-task final answers, adaptive-mode history, reward, runtime, and failure status.
- Aggregated reward averages, latency means, failure counts, and mode distributions per model pair.
- “Best pair” heuristic for quick default selection plus raw telemetry for deeper analysis.
Reward Benchmarking (atlas-sdk/scripts/benchmark_reward_models.py)
Replay captured session trajectories to compare reward judge stacks without disturbing the orchestrator:
--dataset– supply datasets collected withscripts/collect_reward_trajectories.py.--repeats– multiple passes to quantify variance.--concurrency– concurrent judge evaluations per combo.--markdown-output– write Markdown summaries alongside JSON artifacts.--collect-audit– include serialized prompts/responses for debugging reward prompts.
- Reward mean, standard deviation, and uncertainty per combo.
- Escalation rates, failure counts, and agreement vs the baseline stack.
- Latency statistics (average, median, p95).
Guarding Reward Schemas Inside Atlas Core
Atlas Core still ships thetests/test_reward_schema.py regression suite to ensure the trainer-side configs stay in sync
with SDK telemetry:
- Reward config
_target_paths and prompt references inconfigs/reward/+configs/trainer/reward/. - Schema compatibility with the latest SDK telemetry (per-judge fields, optional blocks, etc.).
- Trainer exports (
trainers/__init__.py) so downstream imports keep working.
Related Reading
Learning System Architecture– how playbooks are synthesized and stored.Runtime Safety & Review– guardrails that influence which sessions enter the harnesses.Database Schema– table-level reference for the telemetry each harness pulls.