Skip to main content
Atlas Core uses Hydra to compose model, dataset, trainer, and reward presets. This page is a complete parameter reference for launching GRPO, GKD, or SFT jobs.
For workflow guides, see GRPO Training or GKD Training. This page focuses on exhaustive parameter lookup.

Training Method Comparison

MethodWhen to UseSpeedData RequirementOutput
GRPOTrain from rewards (RL)24-48hRuntime traces with reward signalsRL-optimized teacher
GKDDistill large \u2192 small teacher4-8h (9-30× faster than GRPO)Strong teacher existsCompressed teacher
SFTSupervised warmup2-4hApproved conversational tracesBaseline teacher
→ Full comparison in Offline Training Guide

Hydra Composition Map

Hydra builds a training run by merging defaults from each config group:
LayerExample FilesWhen to Modify
train.yamlconfigs/train.yamlGlobal logging/output rules
model@_global_qwen3_8b.yaml, base.yamlSwap checkpoints, quantization
data@_global_runtime_traces.yaml, arc_atlas_rl.yamlChoose datasets, sampling
trainer@_global_grpo.yaml, base_sft.yaml, teacher_grpo.yamlAlgorithm, optimizer, batching
run@_global_teacher_rcl.yaml, teacher_sft.yamlPre-built experiment bundles
reward@_global_rim_teaching.yamlReward adapter, prompt templates

Model Presets (configs/model/)

ParameterDefaultWhen to Change
model_name_or_pathRequired (e.g., Qwen/Qwen3-8B)Every run - specify checkpoint
tokenizer_name_or_path${model_name_or_path}Distinct tokenizer needed
trust_remote_codetrueVendor-specific architectures
use_peftfalseEnable LoRA/PEFT adapters
load_in_4bitfalseGPU memory constrained
tokenizer.padding_sideleftKeeps RL rollouts aligned
unsafe_tokenizer_loadingfalseUntrusted tokenizer code
torch_dtypebfloat16 (qwen3_8b)Hardware-specific precision
attn_implementationflash_attention_2Faster attention on supported GPUs

Dataset Presets (configs/data/)

ParameterDefaultWhen to Change
dataset_id_or_pathArc-Intelligence/Arc-ATLAS-Teach-v0HuggingFace hub ID or local path
dataset_splitrl, trainMulti-split datasets
dataset_level_filternull (e.g., level_4_5 for BigMath)Curriculum control
dataset_max_samplesnullSubsample for quick experiments
eval_split_ratio0.1Define held-out eval share
shuffletrue (runtime_traces)Randomize JSONL before split
completion_only_trainingTrue (arc_atlas_sft)Trim prompts in SFT
dataset_pathtraces/export.jsonl (runtime_traces)Point at exported JSONL
make_dataset_fn._target_custom_data.runtime_trace_data.get_runtime_trace_datasetLoader entrypoint
Common Dataset Configs:
  • runtime_traces.yaml - Exported JSONL from Atlas SDK
  • arc_atlas_rl.yaml - Pre-collected RL dataset
  • arc_atlas_sft.yaml - Supervised fine-tuning dataset

Trainer Base Defaults (configs/trainer/base.yaml)

ParameterDefaultWhen to Change
max_steps450Override in run recipe
num_train_epochs1Mutually exclusive with max_steps
train_batch_size64Effective batch across devices
per_device_train_batch_size2Per-rank micro batch
gradient_accumulation_stepsInferredAuto-computed if omitted
gradient_checkpointingtrueMemory savings for long contexts
learning_rate5e-7Baseline LR for RL
weight_decay0Regularization needed
max_grad_norm1.0Gradient clipping value
lr_scheduler_type"cosine"Constant/linear schedules
warmup_ratio0.03Warmup fraction of total steps
bf16 / tf32true / trueMixed-precision on supported GPUs
ddp_timeout18000 secondsDistributed training timeout
gradient_accumulation_steps is auto-computed: train_batch_size / (per_device_train_batch_size × num_devices). Provide any two values; the launcher resolves the third (see train.py:55-103).

GRPO Algorithm Controls (configs/trainer/grpo.yaml)

ParameterDefaultWhen to Change
max_steps200Override base default
train_batch_size252Must divide evenly by devices
per_device_train_batch_size3Increase for fewer devices
num_generationsnullLimit for budget control
learning_rate1e-6RL-specific step size
beta0.04KL penalization strength
max_prompt_length / max_completion_length2048 / 16384Truncate input/output
shuffle_generation_inputstrueShuffle prompts before generation
temperature / top_p / top_k / min_p1.0 / 1.0 / null / nullSampling controls for rollouts
repetition_penalty1.0Discourage repetition
use_vllmtrueFast generation (recommended)
vllm_device"auto"Auto-select devices
vllm_gpu_memory_utilization0.9Cap GPU memory per worker
vllm_dtype"auto"Hardware-based dtype
vllm_max_model_lennullOverride max context length
use_rayfalseRemote vLLM via Ray
ray_tensor_parallelism1Split model across GPUs
enable_prefix_cachingfalseCache prompt prefixes
enforce_eagertruePyTorch eager execution (safer debugging)
use_vllm_serverfalseExternal vLLM server
vllm_host / vllm_portnull / nullRequired if use_vllm_server=true
reward_weightsnullPer-judge scaling factors
sync_ref_modelfalseKeep reference model in sync
ref_model_sync_steps64Sync frequency (steps)
unbias_log_probabilitiestrueCorrect for temperature scaling
log_completionsfalseStore sampled completions
push_to_hubfalsePublish to HuggingFace Hub
activate_debugging_logsfalseExtra diagnostics

Teacher GRPO Overlay (configs/trainer/teacher_grpo.yaml)

Extends base GRPO with diagnostic prompts and teacher-specific controls.
ParameterDefaultWhen to Change
trainer_log_nameteacher_grpo_rw_${reward_log_name}Appends reward preset name
logging_prob0.1Fraction of episodes logged
student_modelnullCo-train student alongside teacher
use_reference_teacher_modelfalseCompare vs static reference
completion_only_trainingfalseCompletion-only datasets
trainer_args.max_probe_tokens500Diagnostic prompt budget
trainer_args.student_diagnostic_templateMultilineReflection prompt (see below)
trainer_args.teacher_adaptive_templateMultilineGuidance prompt (see below)
trainer_args.student_with_teaching_templateMultilineApply feedback prompt (see below)
Default Prompt Templates:
student_diagnostic_template: |
  Question: {question}
  Before solving, briefly describe:
  1. What type of problem this is
  2. The key concepts or steps needed
  3. Any potential challenges you see

teacher_adaptive_template: |
  Question: {question}
  Student's approach: {approach}

  <thinking>
  [Analyze student approach]
  </thinking>

  <teaching>
  [Only guidance to student - no answers]
  </teaching>

student_with_teaching_template: |
  Question: {question}
  A teacher has provided: {teaching}

  Now solve step by step.
  <solution></solution>

SFT Trainer (configs/trainer/base_sft.yaml)

ParameterDefaultRun Override (teacher_sft.yaml)When to Change
num_train_epochs110Longer supervised training
max_steps-1-1Negative disables step cap
train_batch_size6416Effective batch for SFT
per_device_train_batch_size41Pair with gradient accum
learning_rateBase: 5e-7, SFT: 2e-42e-4Higher LR for SFT
lr_scheduler_type"cosine""constant"SFT prefers constant
warmup_ratio0.030.1Warm start for supervised
max_seq_length409616384Match runtime telemetry
packingtruefalseDisable for long contexts
do_evaltruefalseEnable with validation splits
ddp_timeout18000180000000Long sequences across ranks

Run Recipes (configs/run/)

Pre-built experiment bundles that override multiple config groups.
RecipeKey OverridesUse Case
default.yamlEmpty (inherit globals)CLI override experiments
teacher_rcl.yamlModel: qwen3_8b, Data: arc_atlas_rl, Trainer: teacher_grpo, Batch: 128, vLLM server enabledProduction GRPO for reward-conditioned learning
teacher_sft.yamlTrainer: base_sft, Data: arc_atlas_sft, Epochs: 10, Max seq: 16384Supervised warmup before GRPO

Reward Preset (configs/trainer/reward/rim_teaching.yaml)

ParameterDefaultWhen to Change
reward_log_namerim_teachingPropagates to trainer logs
max_probe_tokens500Diagnostic prompt length
teacher_reward._target_RIM.reward_adapter.RIMRewardUses rim_offline_config.yaml
student_model / teacher_model / tokenizernullAuto-populated from Hydra model

Reward Configs (Runtime vs Offline)

SettingRuntime (rim_config.yaml)Offline (rim_offline_config.yaml)
temperatures[0.2, 0.5, 0.8][0.2, 0.5, 0.8]
models.small_model"gemini/gemini-2.5-flash""gemini/gemini-2.5-flash"
models.large_model"gemini/gemini-2.5-pro""gemini/gemini-2.5-pro"
active_judgesaccuracy, helpfulness, process, diagnostic (all true)helpfulness, process (accuracy/diagnostic false)
anti_gaming.cap_score0.30.3
parallel_execution.max_workers88

Next Steps