Training Configuration

Atlas Core uses Hydra to compose model, dataset, trainer, and reward presets. This page is a complete parameter reference for launching GRPO, GKD, or SFT jobs.

For workflow guides, see GRPO Training or GKD Training. This page focuses on exhaustive parameter lookup.

Training Method Comparison

Method	When to Use	Speed	Data Requirement	Output
GRPO	Train from rewards (RL)	24-48h	Runtime traces with reward signals	RL-optimized teacher
GKD	Distill large \u2192 small teacher	4-8h (9-30× faster than GRPO)	Strong teacher exists	Compressed teacher
SFT	Supervised warmup	2-4h	Approved conversational traces	Baseline teacher

→ Full comparison in Offline Training Guide

Hydra Composition Map

Hydra builds a training run by merging defaults from each config group:

Layer	Example Files	When to Modify
`train.yaml`	`configs/train.yaml`	Global logging/output rules
`model@_global_`	`qwen3_8b.yaml`, `base.yaml`	Swap checkpoints, quantization
`data@_global_`	`runtime_traces.yaml`, `arc_atlas_rl.yaml`	Choose datasets, sampling
`trainer@_global_`	`grpo.yaml`, `base_sft.yaml`, `teacher_grpo.yaml`	Algorithm, optimizer, batching
`run@_global_`	`teacher_rcl.yaml`, `teacher_sft.yaml`	Pre-built experiment bundles
`reward@_global_`	`rim_teaching.yaml`	Reward adapter, prompt templates

Model Presets (`configs/model/`)

Parameter	Default	When to Change
`model_name_or_path`	Required (e.g., `Qwen/Qwen3-8B`)	Every run - specify checkpoint
`tokenizer_name_or_path`	`${model_name_or_path}`	Distinct tokenizer needed
`trust_remote_code`	`true`	Vendor-specific architectures
`use_peft`	`false`	Enable LoRA/PEFT adapters
`load_in_4bit`	`false`	GPU memory constrained
`tokenizer.padding_side`	`left`	Keeps RL rollouts aligned
`unsafe_tokenizer_loading`	`false`	Untrusted tokenizer code
`torch_dtype`	`bfloat16` (qwen3_8b)	Hardware-specific precision
`attn_implementation`	`flash_attention_2`	Faster attention on supported GPUs

Dataset Presets (`configs/data/`)

Parameter	Default	When to Change
`dataset_id_or_path`	`Arc-Intelligence/Arc-ATLAS-Teach-v0`	HuggingFace hub ID or local path
`dataset_split`	`rl`, `train`	Multi-split datasets
`dataset_level_filter`	`null` (e.g., `level_4_5` for BigMath)	Curriculum control
`dataset_max_samples`	`null`	Subsample for quick experiments
`eval_split_ratio`	`0.1`	Define held-out eval share
`shuffle`	`true` (`runtime_traces`)	Randomize JSONL before split
`completion_only_training`	`True` (`arc_atlas_sft`)	Trim prompts in SFT
`dataset_path`	`traces/export.jsonl` (`runtime_traces`)	Point at exported JSONL
`make_dataset_fn._target_`	`custom_data.runtime_trace_data.get_runtime_trace_dataset`	Loader entrypoint

Common Dataset Configs:

runtime_traces.yaml - Exported JSONL from Atlas SDK
arc_atlas_rl.yaml - Pre-collected RL dataset
arc_atlas_sft.yaml - Supervised fine-tuning dataset

Trainer Base Defaults (`configs/trainer/base.yaml`)

Parameter	Default	When to Change
`max_steps`	`450`	Override in run recipe
`num_train_epochs`	`1`	Mutually exclusive with max_steps
`train_batch_size`	`64`	Effective batch across devices
`per_device_train_batch_size`	`2`	Per-rank micro batch
`gradient_accumulation_steps`	Inferred	Auto-computed if omitted
`gradient_checkpointing`	`true`	Memory savings for long contexts
`learning_rate`	`5e-7`	Baseline LR for RL
`weight_decay`	`0`	Regularization needed
`max_grad_norm`	`1.0`	Gradient clipping value
`lr_scheduler_type`	`"cosine"`	Constant/linear schedules
`warmup_ratio`	`0.03`	Warmup fraction of total steps
`bf16` / `tf32`	`true` / `true`	Mixed-precision on supported GPUs
`ddp_timeout`	`18000` seconds	Distributed training timeout

gradient_accumulation_steps is auto-computed: train_batch_size / (per_device_train_batch_size × num_devices). Provide any two values; the launcher resolves the third (see train.py:55-103).

GRPO Algorithm Controls (`configs/trainer/grpo.yaml`)

Parameter	Default	When to Change
`max_steps`	`200`	Override base default
`train_batch_size`	`252`	Must divide evenly by devices
`per_device_train_batch_size`	`3`	Increase for fewer devices
`num_generations`	`null`	Limit for budget control
`learning_rate`	`1e-6`	RL-specific step size
`beta`	`0.04`	KL penalization strength
`max_prompt_length` / `max_completion_length`	`2048` / `16384`	Truncate input/output
`shuffle_generation_inputs`	`true`	Shuffle prompts before generation
`temperature` / `top_p` / `top_k` / `min_p`	`1.0` / `1.0` / `null` / `null`	Sampling controls for rollouts
`repetition_penalty`	`1.0`	Discourage repetition
`use_vllm`	`true`	Fast generation (recommended)
`vllm_device`	`"auto"`	Auto-select devices
`vllm_gpu_memory_utilization`	`0.9`	Cap GPU memory per worker
`vllm_dtype`	`"auto"`	Hardware-based dtype
`vllm_max_model_len`	`null`	Override max context length
`use_ray`	`false`	Remote vLLM via Ray
`ray_tensor_parallelism`	`1`	Split model across GPUs
`enable_prefix_caching`	`false`	Cache prompt prefixes
`enforce_eager`	`true`	PyTorch eager execution (safer debugging)
`use_vllm_server`	`false`	External vLLM server
`vllm_host` / `vllm_port`	`null` / `null`	Required if use_vllm_server=true
`reward_weights`	`null`	Per-judge scaling factors
`sync_ref_model`	`false`	Keep reference model in sync
`ref_model_sync_steps`	`64`	Sync frequency (steps)
`unbias_log_probabilities`	`true`	Correct for temperature scaling
`log_completions`	`false`	Store sampled completions
`push_to_hub`	`false`	Publish to HuggingFace Hub
`activate_debugging_logs`	`false`	Extra diagnostics

Teacher GRPO Overlay (`configs/trainer/teacher_grpo.yaml`)

Extends base GRPO with diagnostic prompts and teacher-specific controls.

Parameter	Default	When to Change
`trainer_log_name`	`teacher_grpo_rw_${reward_log_name}`	Appends reward preset name
`logging_prob`	`0.1`	Fraction of episodes logged
`student_model`	`null`	Co-train student alongside teacher
`use_reference_teacher_model`	`false`	Compare vs static reference
`completion_only_training`	`false`	Completion-only datasets
`trainer_args.max_probe_tokens`	`500`	Diagnostic prompt budget
`trainer_args.student_diagnostic_template`	Multiline	Reflection prompt (see below)
`trainer_args.teacher_adaptive_template`	Multiline	Guidance prompt (see below)
`trainer_args.student_with_teaching_template`	Multiline	Apply feedback prompt (see below)

Default Prompt Templates:

student_diagnostic_template: |
  Question: {question}
  Before solving, briefly describe:
  1. What type of problem this is
  2. The key concepts or steps needed
  3. Any potential challenges you see

teacher_adaptive_template: |
  Question: {question}
  Student's approach: {approach}

  <thinking>
  [Analyze student approach]
  </thinking>

  <teaching>
  [Only guidance to student - no answers]
  </teaching>

student_with_teaching_template: |
  Question: {question}
  A teacher has provided: {teaching}

  Now solve step by step.
  <solution></solution>

SFT Trainer (`configs/trainer/base_sft.yaml`)

Parameter	Default	Run Override (`teacher_sft.yaml`)	When to Change
`num_train_epochs`	`1`	`10`	Longer supervised training
`max_steps`	`-1`	`-1`	Negative disables step cap
`train_batch_size`	`64`	`16`	Effective batch for SFT
`per_device_train_batch_size`	`4`	`1`	Pair with gradient accum
`learning_rate`	Base: `5e-7`, SFT: `2e-4`	`2e-4`	Higher LR for SFT
`lr_scheduler_type`	`"cosine"`	`"constant"`	SFT prefers constant
`warmup_ratio`	`0.03`	`0.1`	Warm start for supervised
`max_seq_length`	`4096`	`16384`	Match runtime telemetry
`packing`	`true`	`false`	Disable for long contexts
`do_eval`	`true`	`false`	Enable with validation splits
`ddp_timeout`	`18000`	`180000000`	Long sequences across ranks

Run Recipes (`configs/run/`)

Pre-built experiment bundles that override multiple config groups.

Recipe	Key Overrides	Use Case
`default.yaml`	Empty (inherit globals)	CLI override experiments
`teacher_rcl.yaml`	Model: `qwen3_8b`, Data: `arc_atlas_rl`, Trainer: `teacher_grpo`, Batch: 128, vLLM server enabled	Production GRPO for reward-conditioned learning
`teacher_sft.yaml`	Trainer: `base_sft`, Data: `arc_atlas_sft`, Epochs: 10, Max seq: 16384	Supervised warmup before GRPO

Reward Preset (`configs/trainer/reward/rim_teaching.yaml`)

Parameter	Default	When to Change
`reward_log_name`	`rim_teaching`	Propagates to trainer logs
`max_probe_tokens`	`500`	Diagnostic prompt length
`teacher_reward._target_`	`RIM.reward_adapter.RIMReward`	Uses `rim_offline_config.yaml`
`student_model` / `teacher_model` / `tokenizer`	`null`	Auto-populated from Hydra model

Reward Configs (Runtime vs Offline)

Setting	Runtime (`rim_config.yaml`)	Offline (`rim_offline_config.yaml`)
`temperatures`	`[0.2, 0.5, 0.8]`	`[0.2, 0.5, 0.8]`
`models.small_model`	`"gemini/gemini-2.5-flash"`	`"gemini/gemini-2.5-flash"`
`models.large_model`	`"gemini/gemini-2.5-pro"`	`"gemini/gemini-2.5-pro"`
`active_judges`	accuracy, helpfulness, process, diagnostic (all true)	helpfulness, process (accuracy/diagnostic false)
`anti_gaming.cap_score`	`0.3`	`0.3`
`parallel_execution.max_workers`	`8`	`8`

Next Steps

GRPO Training

Launch reward-conditioned learning runs

GKD Training

Distill large teachers into small models

Reward System

Judge design, weights, escalation

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

Training Configuration

Training Method Comparison

Hydra Composition Map

Model Presets (`configs/model/`)

Dataset Presets (`configs/data/`)

Trainer Base Defaults (`configs/trainer/base.yaml`)

GRPO Algorithm Controls (`configs/trainer/grpo.yaml`)

Teacher GRPO Overlay (`configs/trainer/teacher_grpo.yaml`)

SFT Trainer (`configs/trainer/base_sft.yaml`)

Run Recipes (`configs/run/`)

Reward Preset (`configs/trainer/reward/rim_teaching.yaml`)

Reward Configs (Runtime vs Offline)

Next Steps

GRPO Training

GKD Training

Reward System

Getting Started

SDK Guides

Examples

Training

Core Concepts

Reference

Benchmarks

​Training Method Comparison

​Hydra Composition Map

​Model Presets (configs/model/)

​Dataset Presets (configs/data/)

​Trainer Base Defaults (configs/trainer/base.yaml)

​GRPO Algorithm Controls (configs/trainer/grpo.yaml)

​Teacher GRPO Overlay (configs/trainer/teacher_grpo.yaml)

​SFT Trainer (configs/trainer/base_sft.yaml)

​Run Recipes (configs/run/)

​Reward Preset (configs/trainer/reward/rim_teaching.yaml)

​Reward Configs (Runtime vs Offline)

​Next Steps

GRPO Training

GKD Training

Reward System

Training Method Comparison

Hydra Composition Map

Model Presets (`configs/model/`)

Dataset Presets (`configs/data/`)

Trainer Base Defaults (`configs/trainer/base.yaml`)

GRPO Algorithm Controls (`configs/trainer/grpo.yaml`)

Teacher GRPO Overlay (`configs/trainer/teacher_grpo.yaml`)

SFT Trainer (`configs/trainer/base_sft.yaml`)

Run Recipes (`configs/run/`)

Reward Preset (`configs/trainer/reward/rim_teaching.yaml`)

Reward Configs (Runtime vs Offline)

Next Steps