Data Format Requirements
Your dataset should follow this structure:Preprocessing Pipeline (JSONL exports)
Use the runtime helpers that ship in this repository to turn SDK exports into trainer-ready splits:Postgres-Backed Workflows
For Postgres-backed workflows, query the SDK database directly and convert records withatlas_core.data.runtime_traces:
records is a list of dictionaries that any Hugging Face Dataset can ingest (the same structure Hydra configs consume via atlas_core.data.runtime_traces). See the Training Data Pipeline guide for additional filters and batching helpers.
GKD alignment note: Every conversation record now carries
prompt_text (serialized messages excluding the final assistant turn) and completion_text (the assistant response the student learns to mimic). These fields let the distillation pipeline re-render prompts with both the student and teacher tokenizers so cross-tokenizer KL is computed in each model’s native chat template.Quality Validation
Inspect coverage with standard Python tooling—you already havedatasets installed for training:
prompt, student_response, guidance, rewards) identical to what the SDK emits so Atlas Core can reuse the traces without custom glue code.
Next Steps
Datasets Reference
Official datasets and schemas
GKD Training
Train with custom datasets
GRPO Training
RL training with custom data
Export Traces
Export runtime traces from the SDK