Core Concepts
ATLAS
Adaptive Teaching and Learning Alignment System - A continual learning framework that separates complex RL training into offline teacher preparation and online task adaptation.Continual Learning
The ability of an agent to improve from experience and transfer learned skills across tasks without retraining the base model weights.Hybrid Architecture
ATLAS’s approach of separating offline RL training (for teachers) from online optimization (for task adaptation), enabling both stability and flexibility.Training Algorithms
GRPO
Group Relative Policy Optimization - The offline RL algorithm used to train ATLAS teacher models. Optimizes teaching policies through group-relative rewards with KL divergence constraints.GEPA
Genetic Evolution for Prompt Adaptation - The online optimization algorithm that rapidly adapts to specific tasks through evolutionary search without model retraining.SFT
Supervised Fine-Tuning - Initial training phase that establishes baseline capabilities before RL optimization. Required warmup step before GRPO training.Technical Terms
Two-Pass Protocol
ATLAS’s inference pattern:- Diagnostic Probe (≤50 tokens): Teacher assesses student capability
- Adaptive Guidance (≤200 tokens): Teacher provides calibrated assistance
Teacher Model
Specialized 8B parameter models trained with GRPO to diagnose and guide other language models. Pre-trained versions available on HuggingFace.Student Model
Any language model (GPT, Claude, Llama, etc.) that receives guidance from the teacher. Does not require modification or training.Non-Degradation Rate
Percentage of interactions where performance remains equal to or better than baseline (target: ≥97%).Compounding Intelligence
The accumulation and transfer of learned skills across tasks and domains through the hybrid architecture.Metrics
TES (Teaching Efficiency Score)
(accuracy_gain * completion_rate) / (teaching_tokens / 1000)
Measures the efficiency of teaching relative to token usage.
NDR (Non-Degradation Rate)
Percentage of cases where ATLAS-enhanced response equals or exceeds baseline performance.Learning Rate (LR)
In ATLAS context:Δ_performance / num_iterations
Measures how quickly the system adapts to new tasks.
Infrastructure
vLLM
High-throughput inference server used during GRPO training for efficient generation. Handles distributed inference across GPUs.Flash Attention
Memory-efficient attention mechanism that speeds up training and reduces GPU memory usage. Recommended for all deployments.KL Divergence
Kullback-Leibler divergence - Constraint used in GRPO to prevent policy collapse by keeping the trained model close to the reference model.Optimization Terms
Beta (β)
KL divergence coefficient in GRPO (default: 0.04). Controls how much the policy can deviate from the reference model.Temperature
Sampling parameter controlling randomness in generation (default: 0.7). Higher values increase diversity.Gradient Accumulation
Technique to simulate larger batch sizes by accumulating gradients over multiple forward passes before updating weights.See Also
- Technical Report - Detailed methodology
- Core Concepts - In-depth explanations
- Training Guide - Practical implementation