Skip to main content

Core Concepts

ATLAS

Adaptive Teaching and Learning Alignment System - A continual learning framework that separates complex RL training into offline teacher preparation and online task adaptation.

Continual Learning

The ability of an agent to improve from experience and transfer learned skills across tasks without retraining the base model weights.

Hybrid Architecture

ATLAS’s approach of separating offline RL training (for teachers) from runtime continual learning (for task adaptation), enabling both stability and flexibility.

Training Algorithms

GRPO

Group Relative Policy Optimization - The offline RL algorithm used to train ATLAS teacher models. Optimizes teaching policies through group-relative rewards with KL divergence constraints.

SFT

Supervised Fine-Tuning - Initial training phase that establishes baseline capabilities before RL optimization. Required warmup step before GRPO training.

Technical Terms

Two-Pass Protocol

ATLAS’s inference pattern:
  1. Diagnostic Probe (≤50 tokens): Teacher assesses student capability
  2. Adaptive Guidance (≤200 tokens): Teacher provides calibrated assistance

Teacher Model

Specialized 8B parameter models trained with GRPO to diagnose and guide other language models. Pre-trained versions available on HuggingFace.

Student Model

Any language model, agent, or AI system that receives guidance from the ATLAS teacher. This includes:
  • Commercial LLMs (GPT, Claude, Gemini)
  • Open models (Llama, Mistral, Qwen)
  • Your custom agents (OpenAI Assistants, LangChain, AutoGen)
  • API endpoints or services
  • CLI-based tools or scripts
The student model remains unchanged - ATLAS enhances its responses through external guidance, not modification.

Non-Degradation Rate

Percentage of interactions where performance remains equal to or better than baseline (target: ≥97%).

Compounding Intelligence

The accumulation and transfer of learned skills across tasks and domains through the hybrid architecture.

RIM (Reward Interpretation Model)

Atlas’s reward ensemble that evaluates every step and final answer using multiple judging LLMs. The runtime (RIMConfig) routes each interaction through small and large judges, aggregates their scores, and decides whether to retry, certify, or persist guidance.

Persona

Runtime persona prompt bundles (planner, executor, synthesizer, verifier) that shape student and teacher behaviour. Personas can be updated via memory, tagged for reuse, and inspected in exported traces to understand how guidance evolved.

Triage Dossier

Structured context produced by the triage adapter before execution begins. The dossier summarises task metadata, risks, history, and persona hints; it informs the capability probe and lane selection and is exported with every runtime trace.

Metrics

TES (Teaching Efficiency Score)

(accuracy_gain * completion_rate) / (teaching_tokens / 1000) Measures the efficiency of teaching relative to token usage.

NDR (Non-Degradation Rate)

Percentage of cases where ATLAS-enhanced response equals or exceeds baseline performance.

Learning Rate (LR)

In ATLAS context: Δ_performance / num_iterations Measures how quickly the system adapts to new tasks.

Infrastructure

vLLM

High-throughput inference server used during GRPO training for efficient generation. Handles distributed inference across GPUs.

Flash Attention

Memory-efficient attention mechanism that speeds up training and reduces GPU memory usage. Recommended for all deployments.

KL Divergence

Kullback-Leibler divergence - Constraint used in GRPO to prevent policy collapse by keeping the trained model close to the reference model.

Optimization Terms

Beta (β)

KL divergence coefficient in GRPO (default: 0.04). Controls how much the policy can deviate from the reference model.

Temperature

Sampling parameter controlling randomness in generation (default: 0.7). Higher values increase diversity.

Gradient Accumulation

Technique to simulate larger batch sizes by accumulating gradients over multiple forward passes before updating weights.

See Also