Overview

ATLAS configurations control every aspect of training and inference. Parameters are organized into logical groups for easier navigation. ATLAS Architecture

Typical Usage

from configs import GRPOConfig
from trainers import GRPOTrainer

config = GRPOConfig(
    model_name_or_path="Arc-Intelligence/ATLAS-8B-Thinking",
    learning_rate=5e-6,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    beta=0.04,  # KL penalty
    temperature=0.7
)

trainer = GRPOTrainer(config)
trainer.train()

GRPOConfig Parameters

ParameterTypeDefaultDescription
model_name_or_pathstrrequiredHuggingFace model or local path
learning_ratefloat1e-6Initial learning rate for AdamW optimizer
num_train_epochsint3Number of training epochs (inherited)
per_device_train_batch_sizeint8Batch size per GPU/TPU core (inherited)
gradient_accumulation_stepsint1Steps before backward pass (inherited)
warmup_ratiofloat0.1Ratio of warmup steps (inherited)
weight_decayfloat0.01L2 regularization coefficient (inherited)
max_grad_normfloat1.0Maximum gradient norm for clipping (inherited)
Note: Many parameters are inherited from transformers.TrainingArguments
ParameterTypeDefaultDescription
betafloat0.04KL coefficient
temperaturefloat0.9Temperature for sampling completions
num_generationsint8Number of generations to sample
max_completion_lengthint256Maximum length of generated completion
max_prompt_lengthint512Maximum prompt length (truncated left)
reward_weightslist[float]NoneWeights for each reward function
From actual source code: trainers/grpo_config.py
ParameterTypeDefaultDescription
top_kintNoneTop-k sampling parameter
top_pfloat1.0Nucleus sampling threshold
min_pfloatNoneMinimum token probability
repetition_penaltyfloat1.0Penalty for token repetition
generation_aggregation_stepsintNoneAggregates generations across steps
shuffle_generation_inputsboolFalseRandomly shuffle prompts
From source: Values taken directly from grpo_config.py
ParameterTypeDefaultDescription
use_vllmboolFalseUse vLLM for generating completions
use_vllm_serverboolFalseUse a vLLM server for generation
vllm_devicestr”auto”Device where vLLM generation runs
vllm_gpu_memory_utilizationfloat0.9GPU memory ratio for vLLM
vllm_dtypestr”auto”Data type for vLLM generation
vllm_max_model_lenintNoneMax model length for vLLM
vllm_hoststrNoneHost of the vLLM server
vllm_portintNonePort of the vLLM server
num_vllm_clientsint1Number of vLLM clients
Verified from: grpo_config.py lines 102-184
ParameterTypeDefaultDescription
offload_untrained_modelsboolFalseOffload reference/reward models to minimize memory
ds3_gather_for_generationboolTrueGather policy weights for generation (DeepSpeed ZeRO-3)
backprop_accumulation_stepsintNoneAccumulate loss during backprop computations
backprop_accumulation_micro_batch_sizeintNoneMax per-device batch during backprop
remove_unused_columnsboolFalseKeep only ‘prompt’ column in dataset
Note: Other memory options inherited from TrainingArguments
ParameterTypeDefaultDescription
max_probe_tokensint500Maximum tokens for student diagnostic probing
student_diagnostic_templatestrNoneTemplate for generating student diagnostic probes
teacher_adaptive_templatestrNoneTemplate for generating teacher adaptive teaching
student_with_teaching_templatestrNoneTemplate for student solution with teaching
student_baseline_templatestrNoneTemplate for student baseline solution
From source: grpo_config.py lines 330-353 (teacher-specific parameters)
ParameterTypeDefaultDescription
logging_stepsint10Log every N steps
save_stepsint500Save checkpoint every N steps
eval_stepsint500Evaluate every N steps
save_total_limitint3Maximum checkpoints to keep
load_best_model_at_endboolTrueLoad best model after training
metric_for_best_modelstr”eval_reward”Metric for model selection
greater_is_betterboolTrueWhether metric should increase
report_tolist[“wandb”]Logging integrations
Best practices:
  • Set save_steps = eval_steps for consistency
  • Use save_total_limit to manage disk space
  • Enable W&B for experiment tracking

Teacher Training Usage

TeacherGRPOTrainer uses the same GRPOConfig but accepts additional constructor parameters:
from trainers.teacher_trainers import TeacherGRPOTrainer
from trainers.grpo_config import GRPOConfig

config = GRPOConfig(
    model_name_or_path="Arc-Intelligence/ATLAS-8B-Thinking",
    max_probe_tokens=500,  # Teacher-specific parameter
    learning_rate=1e-6
)

trainer = TeacherGRPOTrainer(
    config,
    student_model="meta-llama/Llama-3.2-8B-Instruct",  # Constructor parameter
    # Other standard parameters...
)

Command-Line Overrides

Any parameter can be overridden via command line:
# Override single parameter
scripts/launch.sh 8 configs/run/teacher_sft.yaml learning_rate=1e-5

# Override multiple parameters
scripts/launch.sh 8 configs/run/teacher_sft.yaml \
  learning_rate=1e-5 \
  num_train_epochs=5 \
  per_device_train_batch_size=2

# Override nested parameters
scripts/launch.sh 8 configs/run/teacher_sft.yaml \
  generation_kwargs.temperature=0.9 \
  generation_kwargs.top_p=0.8

Configuration Usage

ATLAS configurations are standard dataclasses extending transformers.TrainingArguments:
from trainers.grpo_config import GRPOConfig
from trainers.grpo import GRPOTrainer

config = GRPOConfig(
    model_name_or_path="Arc-Intelligence/ATLAS-8B-Thinking",
    learning_rate=1e-6,  # Default from actual config
    beta=0.04,  # KL coefficient
    temperature=0.9,  # Default from actual config
    num_generations=8  # Default from actual config
)

trainer = GRPOTrainer(config)
trainer.train()

Source Code

For complete implementation details:
  • GRPOConfig: trainers/grpo_config.py
  • GRPOTrainer: trainers/grpo.py
  • TeacherGRPOTrainer: trainers/teacher_trainers.py

Next Steps