Training configs

title: Training Configs API description: Python reference for loading and overriding Atlas training configurations. sidebarTitle: Training Configs icon: sliders

Overview

ATLAS configurations control every aspect of training and inference. Parameters are organized into logical groups for easier navigation.

Typical Usage

from configs import GRPOConfig
from trainers import GRPOTrainer

config = GRPOConfig(
    model_name_or_path="Arc-Intelligence/ATLAS-8B-Thinking",
    learning_rate=5e-6,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    beta=0.04,  # KL penalty
    temperature=0.7
)

trainer = GRPOTrainer(config)
trainer.train()

GRPOConfig Parameters

Core Training Parameters

Parameter	Type	Default	Description
`model_name_or_path`	str	required	HuggingFace model or local path
`learning_rate`	float	1e-6	Initial learning rate for AdamW optimizer
`num_train_epochs`	int	3	Number of training epochs (inherited)
`per_device_train_batch_size`	int	8	Batch size per GPU/TPU core (inherited)
`gradient_accumulation_steps`	int	1	Steps before backward pass (inherited)
`warmup_ratio`	float	0.1	Ratio of warmup steps (inherited)
`weight_decay`	float	0.01	L2 regularization coefficient (inherited)
`max_grad_norm`	float	1.0	Maximum gradient norm for clipping (inherited)

Note: Many parameters are inherited from transformers.TrainingArguments

GRPO Algorithm Parameters

Parameter	Type	Default	Description
`beta`	float	0.04	KL coefficient
`temperature`	float	0.9	Temperature for sampling completions
`num_generations`	int	8	Number of generations to sample
`max_completion_length`	int	256	Maximum length of generated completion
`max_prompt_length`	int	512	Maximum prompt length (truncated left)
`reward_weights`	list[float]	None	Weights for each reward function

From actual source code: trainers/grpo_config.py

Generation & Sampling

Parameter	Type	Default	Description
`top_k`	int	None	Top-k sampling parameter
`top_p`	float	1.0	Nucleus sampling threshold
`min_p`	float	None	Minimum token probability
`repetition_penalty`	float	1.0	Penalty for token repetition
`generation_aggregation_steps`	int	None	Aggregates generations across steps
`shuffle_generation_inputs`	bool	False	Randomly shuffle prompts

From source: Values taken directly from grpo_config.py

vLLM Integration

Parameter	Type	Default	Description
`use_vllm`	bool	False	Use vLLM for generating completions
`use_vllm_server`	bool	False	Use a vLLM server for generation
`vllm_device`	str	”auto”	Device where vLLM generation runs
`vllm_gpu_memory_utilization`	float	0.9	GPU memory ratio for vLLM
`vllm_dtype`	str	”auto”	Data type for vLLM generation
`vllm_max_model_len`	int	None	Max model length for vLLM
`vllm_host`	str	None	Host of the vLLM server
`vllm_port`	int	None	Port of the vLLM server
`num_vllm_clients`	int	1	Number of vLLM clients

Verified from: grpo_config.py lines 102-184

Memory & Training Optimization

Parameter	Type	Default	Description
`offload_untrained_models`	bool	False	Offload reference/reward models to minimize memory
`ds3_gather_for_generation`	bool	True	Gather policy weights for generation (DeepSpeed ZeRO-3)
`backprop_accumulation_steps`	int	None	Accumulate loss during backprop computations
`backprop_accumulation_micro_batch_size`	int	None	Max per-device batch during backprop
`remove_unused_columns`	bool	False	Keep only ‘prompt’ column in dataset

Note: Other memory options inherited from TrainingArguments

Teacher Training Parameters

Parameter	Type	Default	Description
`max_probe_tokens`	int	500	Maximum tokens for student diagnostic probing
`student_diagnostic_template`	str	None	Template for generating student diagnostic probes
`teacher_adaptive_template`	str	None	Template for generating teacher adaptive teaching
`student_with_teaching_template`	str	None	Template for student solution with teaching
`student_baseline_template`	str	None	Template for student baseline solution

From source: grpo_config.py lines 330-353 (teacher-specific parameters)

Logging & Checkpointing

Parameter	Type	Default	Description
`logging_steps`	int	10	Log every N steps
`save_steps`	int	500	Save checkpoint every N steps
`eval_steps`	int	500	Evaluate every N steps
`save_total_limit`	int	3	Maximum checkpoints to keep
`load_best_model_at_end`	bool	True	Load best model after training
`metric_for_best_model`	str	”eval_reward”	Metric for model selection
`greater_is_better`	bool	True	Whether metric should increase
`report_to`	list	[“wandb”]	Logging integrations

Best practices:

Set save_steps = eval_steps for consistency
Use save_total_limit to manage disk space
Enable W&B for experiment tracking

Reward System Parameters

The reward ensemble is configured through Hydra rather than direct constructor arguments. Before editing YAML, decide which judges you need and how aggressively you want to escalate to the large arbiter. Once the intent is clear, update the rim block in the referenced config file.

# configs/rim_config.yaml
rim:
  temperatures: [0.2, 0.5, 0.8]
  variance_threshold: 0.15
  models:
    small_model: "gemini/gemini-2.5-flash"
    large_model: "gemini/gemini-2.5-pro"
  active_judges:
    accuracy: true
    helpfulness: true
    process: true
    diagnostic: true
  model_configs:
    default:
      max_tokens: 32768
      response_format: "json"
    large:
      max_tokens: 32768
      response_format: "json"
  consistency_rules:
    alignment_completeness_bound: 0.1
    completeness_helpfulness_threshold: 0.5
    understanding_helpfulness_bound: 0.1
    contradiction_penalty: 0.2
    safety_violation_penalty: 0.2
  anti_gaming:
    enabled: true
    cap_score: 0.3
  parallel_execution:
    max_workers: 8

Lower temperatures concentrate the judges around deterministic reasoning, while higher values encourage diverse principle discovery. The variance threshold controls escalation sensitivity: smaller numbers escalate more often, improving correctness at the cost of latency. Disabling a judge removes that dimension from the combined score but still records explanations for active ones. Anti-gaming caps anomalous scores when judges detect contradictions or safety violations.

Teacher Training Usage

TeacherGRPOTrainer uses the same GRPOConfig but accepts additional constructor parameters:

from trainers.teacher_trainers import TeacherGRPOTrainer
from trainers.grpo_config import GRPOConfig

config = GRPOConfig(
    model_name_or_path="Arc-Intelligence/ATLAS-8B-Thinking",
    max_probe_tokens=500,  # Teacher-specific parameter
    learning_rate=1e-6
)

trainer = TeacherGRPOTrainer(
    config,
    student_model="meta-llama/Llama-3.2-8B-Instruct",  # Constructor parameter
    # Other standard parameters...
)

Command-Line Overrides

Any parameter can be overridden via command line:

# Override single parameter
scripts/launch.sh 8 configs/run/teacher_sft.yaml learning_rate=1e-5

# Override multiple parameters
scripts/launch.sh 8 configs/run/teacher_sft.yaml \
  learning_rate=1e-5 \
  num_train_epochs=5 \
  per_device_train_batch_size=2

# Override nested parameters
scripts/launch.sh 8 configs/run/teacher_sft.yaml \
  generation_kwargs.temperature=0.9 \
  generation_kwargs.top_p=0.8

Configuration Usage

ATLAS configurations are standard dataclasses extending transformers.TrainingArguments:

from trainers.grpo_config import GRPOConfig
from trainers.grpo import GRPOTrainer

config = GRPOConfig(
    model_name_or_path="Arc-Intelligence/ATLAS-8B-Thinking",
    learning_rate=1e-6,  # Default from actual config
    beta=0.04,  # KL coefficient
    temperature=0.9,  # Default from actual config
    num_generations=8  # Default from actual config
)

trainer = GRPOTrainer(config)
trainer.train()

Source Code

For complete implementation details:

GRPOConfig: trainers/grpo_config.py
GRPOTrainer: trainers/grpo.py
TeacherGRPOTrainer: trainers/teacher_trainers.py

Next Steps

Trainers Reference

Trainer class documentation

Training Configuration

Understanding config composition and overrides

RL Training Guide

Apply configs to training

Offline Training

Hands-on configuration walkthrough

Getting Started

SDK Runtime

Core Concepts

Training

Examples & Case Studies

Integration

Benchmarks

API Reference

Reference

Training configs

title: Training Configs API description: Python reference for loading and overriding Atlas training configurations. sidebarTitle: Training Configs icon: sliders

Overview

Typical Usage

GRPOConfig Parameters

Reward System Parameters

Teacher Training Usage

Command-Line Overrides

Configuration Usage

Source Code

Next Steps

Trainers Reference

Training Configuration

RL Training Guide

Offline Training

Getting Started

SDK Runtime

Core Concepts

Training

Examples & Case Studies

Integration

Benchmarks

API Reference

Reference

​title: Training Configs API description: Python reference for loading and overriding Atlas training configurations. sidebarTitle: Training Configs icon: sliders

​Overview

​Typical Usage

​GRPOConfig Parameters

​Reward System Parameters

​Teacher Training Usage

​Command-Line Overrides

​Configuration Usage

​Source Code

​Next Steps

Trainers Reference

Training Configuration

RL Training Guide

Offline Training

title: Training Configs API description: Python reference for loading and overriding Atlas training configurations. sidebarTitle: Training Configs icon: sliders

Overview

Typical Usage

GRPOConfig Parameters

Reward System Parameters

Teacher Training Usage

Command-Line Overrides

Configuration Usage

Source Code

Next Steps