Skip to main contentCore Concepts
ATLAS
Adaptive Teaching and Learning Alignment System - A continual learning framework that separates complex RL training into offline teacher preparation and online task adaptation.
Continual Learning
The ability of an agent to improve from experience and transfer learned skills across tasks without retraining the base model weights.
Hybrid Architecture
ATLAS’s approach of separating offline RL training (for teachers) from runtime continual learning (for task adaptation), enabling both stability and flexibility.
Training Algorithms
GRPO
Group Relative Policy Optimization - The offline RL algorithm used to train ATLAS teacher models. Optimizes teaching policies through group-relative rewards with KL divergence constraints.
SFT
Supervised Fine-Tuning - Initial training phase that establishes baseline capabilities before RL optimization. Required warmup step before GRPO training.
Technical Terms
Two-Pass Protocol
ATLAS’s inference pattern:
- Diagnostic Probe (≤50 tokens): Teacher assesses student capability
- Adaptive Guidance (≤200 tokens): Teacher provides calibrated assistance
Teacher Model
Specialized 8B parameter models trained with GRPO to diagnose and guide other language models. Pre-trained versions available on HuggingFace.
Student Model
Any language model, agent, or AI system that receives guidance from the ATLAS teacher. This includes:
- Commercial LLMs (GPT, Claude, Gemini)
- Open models (Llama, Mistral, Qwen)
- Your custom agents (OpenAI Assistants, LangChain, AutoGen)
- API endpoints or services
- CLI-based tools or scripts
The student model remains unchanged - ATLAS enhances its responses through external guidance, not modification.
Non-Degradation Rate
Percentage of interactions where performance remains equal to or better than baseline (target: ≥97%).
Compounding Intelligence
The accumulation and transfer of learned skills across tasks and domains through the hybrid architecture.
RIM (Reward Interpretation Model)
Atlas’s reward ensemble that evaluates every step and final answer using multiple judging LLMs. The runtime (RIMConfig) routes each interaction through small and large judges, aggregates their scores, and decides whether to retry, certify, or persist guidance.
Persona
Runtime persona prompt bundles (planner, executor, synthesizer, verifier) that shape student and teacher behaviour. Personas can be updated via memory, tagged for reuse, and inspected in exported traces to understand how guidance evolved.
Triage Dossier
Structured context produced by the triage adapter before execution begins. The dossier summarises task metadata, risks, history, and persona hints; it informs the capability probe and lane selection and is exported with every runtime trace.
Metrics
TES (Teaching Efficiency Score)
(accuracy_gain * completion_rate) / (teaching_tokens / 1000)
Measures the efficiency of teaching relative to token usage.
NDR (Non-Degradation Rate)
Percentage of cases where ATLAS-enhanced response equals or exceeds baseline performance.
Learning Rate (LR)
In ATLAS context: Δ_performance / num_iterations
Measures how quickly the system adapts to new tasks.
Infrastructure
vLLM
High-throughput inference server used during GRPO training for efficient generation. Handles distributed inference across GPUs.
Flash Attention
Memory-efficient attention mechanism that speeds up training and reduces GPU memory usage. Recommended for all deployments.
KL Divergence
Kullback-Leibler divergence - Constraint used in GRPO to prevent policy collapse by keeping the trained model close to the reference model.
Optimization Terms
Beta (β)
KL divergence coefficient in GRPO (default: 0.04). Controls how much the policy can deviate from the reference model.
Temperature
Sampling parameter controlling randomness in generation (default: 0.7). Higher values increase diversity.
Gradient Accumulation
Technique to simulate larger batch sizes by accumulating gradients over multiple forward passes before updating weights.
See Also