Available Datasets
ATLAS provides curated datasets for training adaptive teachers and evaluating system performance.Primary Dataset
Arc-ATLAS-Teach-v0
View on Hugging Face
Comprehensive teaching interaction dataset for RL training
- Total examples: 100,000+ teaching interactions
- Task domains: Mathematics, reasoning, coding, debugging
- Formats: SFT and RL training splits
- Languages: English
Domain-Specific Subsets
Mathematics Subset
Focus: Step-by-step mathematical reasoning Example:Code Generation Subset
Focus: Programming tasks and debugging Example:SRE/Debugging Subset
Focus: System reliability and debugging scenarios Example:Data Quality Metrics
Coverage Statistics
| Domain | Examples | Avg Length | Unique Patterns |
|---|---|---|---|
| Mathematics | 35,000 | 250 tokens | 500+ |
| Code Generation | 30,000 | 400 tokens | 800+ |
| Reasoning | 25,000 | 300 tokens | 600+ |
| Debugging | 10,000 | 350 tokens | 400+ |
Performance Baselines
| Metric | Baseline | w/ Dual-Agent Loop | Improvement |
|---|---|---|---|
| Accuracy | 62.3% | 78.0% | +15.7% |
| Completion | 69% | 100% | +31% |
| Token Efficiency | 100% | 50% | -50% |
These figures reflect the closed-loop runtime plus GRPO baseline. Online continual learning now lives in the
atlas-sdk runtime if you need task-specific adaptation between offline training runs.Creating Custom Datasets
Data Format Requirements
Your dataset should follow this structure:Preprocessing Pipeline (JSONL exports)
Use the runtime helpers that ship in this repository to turn SDK exports into trainer-ready splits:trainers.runtime_dataset:
records is a list of dictionaries that any Hugging Face Dataset can ingest (the same structure Hydra configs consume via custom_data.runtime_trace_data). See the Training Data Pipeline guide for additional filters and batching helpers.
Quality Validation
Inspect coverage with standard Python tooling—you already havedatasets installed for training:
prompt, student_response, guidance, rewards) identical to what the SDK emits so Atlas Core can reuse the traces without custom glue code.
Contributing Data
We welcome contributions to improve ATLAS datasets:- Format your data according to the schema
- Validate quality using provided tools
- Test with models to ensure compatibility
- Submit PR with data and documentation