Data Format Requirements
Your dataset should follow this structure:Preprocessing Pipeline (JSONL exports)
Use the runtime helpers that ship in this repository to turn SDK exports into trainer-ready splits:Postgres-Backed Workflows
For Postgres-backed workflows, query the SDK database directly and convert records withtrainers.runtime_dataset:
records is a list of dictionaries that any Hugging Face Dataset can ingest (the same structure Hydra configs consume via custom_data.runtime_trace_data). See the Training Data Pipeline guide for additional filters and batching helpers.
GKD alignment note: Every conversation record now carries
prompt_text (serialized messages excluding the final assistant turn) and completion_text (the assistant response the student learns to mimic). These fields let the distillation pipeline re-render prompts with both the student and teacher tokenizers so cross-tokenizer KL is computed in each model’s native chat template.Quality Validation
Inspect coverage with standard Python tooling—you already havedatasets installed for training:
prompt, student_response, guidance, rewards) identical to what the SDK emits so Atlas Core can reuse the traces without custom glue code.