Skip to main content

Available Datasets

ATLAS provides curated datasets for training adaptive teachers and evaluating system performance.

Primary Dataset

Arc-ATLAS-Teach-v0

View on Hugging Face

Comprehensive teaching interaction dataset for RL training
Purpose: Train teacher models to provide adaptive guidance across diverse tasks Statistics:
  • Total examples: 100,000+ teaching interactions
  • Task domains: Mathematics, reasoning, coding, debugging
  • Formats: SFT and RL training splits
  • Languages: English
Data Schema:
{
  "prompt": "The problem or task requiring solution",
  "ground_truth": "Correct answer or solution",
  "student_response": "Initial student attempt",
  "teaching": "Adaptive guidance provided",
  "enhanced_response": "Student response after teaching",
  "baseline_score": 0.3,
  "with_teaching_score": 0.9,
  "reward": 0.6,
  "problem_id": "unique_identifier",
  "student_level": "weak|moderate|strong",
  "domain": "math|reasoning|code|debug"
}
Loading the Dataset:
from datasets import load_dataset

# Load for supervised fine-tuning
sft_data = load_dataset(
    "Arc-Intelligence/Arc-ATLAS-Teach-v0",
    "sft",
    split="train"
)

# Load for reinforcement learning
rl_data = load_dataset(
    "Arc-Intelligence/Arc-ATLAS-Teach-v0",
    "rl",
    split="train"
)

# Load validation set
val_data = load_dataset(
    "Arc-Intelligence/Arc-ATLAS-Teach-v0",
    "rl",
    split="validation"
)
File Structure:
Arc-ATLAS-Teach-v0/
├── training/
│   ├── sft.jsonl         # Supervised fine-tuning data
│   └── rl.jsonl          # Reinforcement learning data
└── validation/
    └── rl.jsonl          # Held-out validation

Domain-Specific Subsets

Mathematics Subset

Focus: Step-by-step mathematical reasoning Example:
{
  "prompt": "Sarah has 24 apples. She gives 1/3 to her brother...",
  "ground_truth": "12",
  "teaching": "Break down: 1) Calculate 1/3 of 24 = 8..."
}
Filtering:
math_data = dataset.filter(lambda x: x['domain'] == 'math')

Code Generation Subset

Focus: Programming tasks and debugging Example:
{
  "prompt": "Write a function to validate email addresses",
  "ground_truth": "def validate_email(email):...",
  "teaching": "Consider regex pattern, edge cases like..."
}
Filtering:
code_data = dataset.filter(lambda x: x['domain'] == 'code')

SRE/Debugging Subset

Focus: System reliability and debugging scenarios Example:
{
  "prompt": "Service returns 503 errors intermittently",
  "ground_truth": "Check service mesh configuration...",
  "teaching": "Systematic approach: 1) Check Istio configs..."
}
Filtering:
sre_data = dataset.filter(lambda x: x['domain'] == 'debug')

Data Quality Metrics

Coverage Statistics

DomainExamplesAvg LengthUnique Patterns
Mathematics35,000250 tokens500+
Code Generation30,000400 tokens800+
Reasoning25,000300 tokens600+
Debugging10,000350 tokens400+

Performance Baselines

MetricBaselinew/ Dual-Agent LoopImprovement
Accuracy62.3%78.0%+15.7%
Completion69%100%+31%
Token Efficiency100%50%-50%
These figures reflect the closed-loop runtime plus GRPO baseline. Online continual learning now lives in the atlas-sdk runtime if you need task-specific adaptation between offline training runs.

Creating Custom Datasets

Data Format Requirements

Your dataset should follow this structure:
{
    "prompt": str,           # Required: Task description
    "ground_truth": str,     # Required: Correct solution
    "metadata": {            # Optional: Additional context
        "difficulty": str,
        "source": str,
        "tags": List[str]
    }
}

Preprocessing Pipeline (JSONL exports)

Use the runtime helpers that ship in this repository to turn SDK exports into trainer-ready splits:
from transformers import AutoTokenizer
from custom_data.runtime_trace_data import get_runtime_trace_dataset

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
splits = get_runtime_trace_dataset(
    tokenizer=tokenizer,
    export_path="traces/runtime.jsonl",  # Generated via `arc-atlas export …`
    eval_split_ratio=0.1,
    dataset_max_samples=5000,
)

train_ds = splits["train_dataset"]
eval_ds = splits["eval_dataset"]
For Postgres-backed workflows, query the SDK database directly and convert records with trainers.runtime_dataset:
from atlas.training_data import get_training_sessions
from trainers.runtime_dataset import sessions_to_rl_records

sessions = get_training_sessions(
    db_url="postgresql://atlas:atlas@localhost:5433/atlas",
    min_reward=0.8,
    learning_key="security-review",
)

records = sessions_to_rl_records(sessions)
records is a list of dictionaries that any Hugging Face Dataset can ingest (the same structure Hydra configs consume via custom_data.runtime_trace_data). See the Training Data Pipeline guide for additional filters and batching helpers.

Quality Validation

Inspect coverage with standard Python tooling—you already have datasets installed for training:
from collections import Counter
from datasets import Dataset

dataset = Dataset.from_list(records)
lengths = [len(example["step_trace"].split()) for example in dataset]
domains = Counter(example["session_metadata"].get("domain", "unknown") for example in dataset)

print(f"Examples: {len(dataset)}")
print(f"Avg step length: {sum(lengths)/len(lengths):.1f} tokens")
print(f"Domains: {domains}")
Pair these quick checks with any in-house validators your team already maintains. The key is to keep the format (prompt, student_response, guidance, rewards) identical to what the SDK emits so Atlas Core can reuse the traces without custom glue code.

Contributing Data

We welcome contributions to improve ATLAS datasets:
  1. Format your data according to the schema
  2. Validate quality using provided tools
  3. Test with models to ensure compatibility
  4. Submit PR with data and documentation
See Contributing Guidelines for details.

License and Citation

Datasets are released under Apache 2.0 license. If you use these datasets, please cite:
@dataset{atlas_teach_v0,
  title={Arc-ATLAS-Teach-v0: Adaptive Teaching Dataset},
  author={Arc Intelligence Team},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/Arc-Intelligence/Arc-ATLAS-Teach-v0}
}

Next Steps