Available Datasets
ATLAS provides curated datasets for training adaptive teachers and evaluating system performance.Primary Dataset
Arc-ATLAS-Teach-v0
View on Hugging Face
Comprehensive teaching interaction dataset for RL training
- Total examples: 100,000+ teaching interactions
- Task domains: Mathematics, reasoning, coding, debugging
- Formats: SFT and RL training splits
- Languages: English
Domain-Specific Subsets
Mathematics Subset
Focus: Step-by-step mathematical reasoning Example:Code Generation Subset
Focus: Programming tasks and debugging Example:SRE/Debugging Subset
Focus: System reliability and debugging scenarios Example:Data Quality Metrics
Coverage Statistics
Domain | Examples | Avg Length | Unique Patterns |
---|---|---|---|
Mathematics | 35,000 | 250 tokens | 500+ |
Code Generation | 30,000 | 400 tokens | 800+ |
Reasoning | 25,000 | 300 tokens | 600+ |
Debugging | 10,000 | 350 tokens | 400+ |
Performance Baselines
Metric | Baseline | w/ Teaching | Improvement |
---|---|---|---|
Accuracy | 62.3% | 78.0% | +15.7% |
Completion | 69% | 100% | +31% |
Token Efficiency | 100% | 50% | -50% |
Creating Custom Datasets
Data Format Requirements
Your dataset should follow this structure:Preprocessing Pipeline
Quality Validation
Contributing Data
We welcome contributions to improve ATLAS datasets:- Format your data according to the schema
- Validate quality using provided tools
- Test with models to ensure compatibility
- Submit PR with data and documentation