What is GEPA?

Overview

GEPA (Genetic-Pareto) is an external optimization package that ATLAS uses to evolve teaching prompts through LLM-based reflection and Pareto-efficient evolutionary search. Published research shows GEPA achieves substantial performance gains with remarkable sample efficiency.

Package: gepa v0.0.12 Source: https://github.com/gepa-ai/gepa Paper: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (Agrawal et al., 2025) License: MIT

Why GEPA?

Traditional approaches to optimizing AI systems face trade-offs between speed, cost, and infrastructure requirements. GEPA offers a middle ground:

Approach	Rollouts Required	Infrastructure	Key Benefit
Manual prompt engineering	N/A	None	Zero cost
GEPA	678-6,858	None (API only)	Fast + sample-efficient
GRPO (RL training)	24,000	4-8 GPUs + LoRA	Maximum performance

Official benchmark results (from Agrawal et al., 2025, Table 1): On Qwen3 8B across four tasks:

HotpotQA: 62.33% (GEPA) vs 43.33% (GRPO) = +19.0% improvement
IFBench: 55.66% (GEPA) vs 52.93% (GRPO) = +2.73% improvement
HoVer: 51.66% (GEPA) vs 38.00% (GRPO) = +13.66% improvement
PUPA: 94.69% (GEPA) vs 89.50% (GRPO) = +5.19% improvement

Average improvement: +10% over GRPO while using up to 35× fewer rollouts GEPA also outperformed MIPROv2 (state-of-the-art prompt optimizer) by +10.3% aggregate across all benchmarks.

How It Works

GEPA combines three core mechanisms from the research paper:

1. Reflective Mutation

Instead of random changes, GEPA uses an LLM to analyze failures and propose targeted improvements:

# Conceptual flow from the paper
def reflective_mutation(current_prompt, failed_examples):
    """Use LLM reflection to intelligently improve prompts."""

    # Step 1: Analyze what went wrong
    reflection = reflection_llm(f"""
    Current prompt: {current_prompt}

    Failed on these examples: {failed_examples}

    Analyze: What specific issues caused these failures?
    What patterns do you see? What should change?
    """)

    # Step 2: Generate improved variant
    new_prompt = reflection_llm(f"""
    Original prompt: {current_prompt}
    Analysis: {reflection}

    Generate an improved prompt addressing these issues.
    """)

    return new_prompt

From the paper: “GEPA iteratively mutates every prompt within the AI system in light of natural language feedback drawn from new rollouts. In each mutation, the candidate prompt is derived from an ancestor, accumulating high-level lessons derived from observations and LLM feedback.”

2. Pareto Frontier Selection

GEPA maintains multiple high-performing candidates rather than just the single best:

Performance on different metrics:

Accuracy ↑
    |
    |    C (kept - strong on accuracy)
    |
    |  A (kept - balanced)    D (discarded - dominated)
    |
    |           B (kept - strong on helpfulness)
    |______________________________________→ Helpfulness

Candidates A, B, and C are all kept because each excels at something. Candidate D is discarded because another prompt dominates it on all metrics. From the paper: “To avoid the local optima that afflict greedy prompt updates, GEPA maintains a Pareto front: instead of evolving only the global best prompt, it stochastically explores the top-performing prompts for each problem instance, thereby diversifying strategies and encouraging robust generalization.”

3. Genetic Evolution Loop

The complete GEPA algorithm (from Figure 3 in the paper):

Initialize with seed prompts
While budget remains:
- Select a parent from Pareto frontier
- Mutate via reflective LLM
- Evaluate on minibatch
- If improved: Add to pool, evaluate on full dataset
- If not improved: Discard
Return best candidate from final pool

Real Example from Research

The paper shows GEPA’s evolution for multi-hop QA (Figure 2): Seed prompt (generic):

Given the fields question, summary 1, produce the fields query.

GEPA’s optimized prompt (after reflective evolution):

You will be given two input fields: question and summary 1.
Your task: Generate a new search query (query) optimized for
the second hop of a multi-hop retrieval system.

• The original user question is typically complex and requires
  information from multiple documents to answer.
• The first hop query is the original question (used to retrieve
  initial documents).
• Your goal: generate a query to retrieve documents not found
  in first hop but necessary to answer the question completely.

Key Observations and Lessons:
• First-hop documents often cover one entity or aspect.
• Remaining relevant documents often involve connected or
  higher-level concepts mentioned in summary 1 but not
  explicitly asked in the original question.
• The query should target these missing, but logically linked,
  documents.

[... detailed strategies continue ...]

This evolved prompt led to significant performance gains on HotpotQA benchmark.

Integration with ATLAS

ATLAS uses GEPA to optimize teaching protocol templates:

# How ATLAS calls GEPA (from optimize_teaching.py)
import gepa
from trainers.prompt_adapter import ATLASGEPAAdapter

# Create ATLAS adapter (handles evaluation)
adapter = ATLASGEPAAdapter(
    teacher_model="gpt-5",
    student_model="gpt-4o-mini",
    generation_config={
        "max_tokens": 512,
        "diagnostic_max_tokens": 100,
        "temperature": 0.7
    }
)

# Run GEPA optimization
result = gepa.optimize(
    seed_candidate={
        "teacher_adaptive_template": "You are a teacher...",
        "student_diagnostic_template": "Show your thinking...",
        "student_with_teaching_template": "Apply guidance..."
    },
    trainset=your_examples,
    valset=validation_examples,
    adapter=adapter,  # ATLAS evaluates via full protocol
    reflection_lm=gpt4_reflection,  # LLM for mutations
    max_metric_calls=40,
    candidate_selection_strategy="pareto"
)

optimized_prompts = result.best_candidate

What each component does:

Adapter: Runs ATLAS protocol (baseline → teaching → enhanced) and scores with RIM
GEPA: Evolves prompts via reflection and Pareto selection
Reflection LLM: Analyzes failures and proposes improvements

Research-Validated Performance

Sample Efficiency

From the paper’s experiments on Qwen3 8B:

Task	GEPA Rollouts to Match GRPO	GRPO Rollouts	Efficiency Gain
HotpotQA	402	24,000	60×
IFBench	330	24,000	73×
HoVer	1,179	24,000	20×
PUPA	306	24,000	78×

GEPA matches GRPO’s validation performance with 20-78× fewer rollouts.

Final Performance (Test Set)

Qwen3 8B Results (Table 1 from paper):

Method	HotpotQA	IFBench	HoVer	PUPA	Aggregate
Baseline	42.33	30.44	38.00	91.63	50.60
MIPROv2	55.33	38.61	40.66	92.85	56.86
GRPO (24k rollouts)	43.33	52.93	38.00	89.50	55.94
GEPA	62.33	55.66	51.66	94.69	66.09

GEPA improvements:

vs Baseline: +15.49% aggregate
vs MIPROv2: +9.23% aggregate
vs GRPO: +10.15% aggregate

When to Use GEPA

Use GEPA when:

Budget limited (~ $10 vs$ 100-1000 for full training)
No GPU infrastructure available
Need results in hours, not days
Testing if ATLAS helps your domain
Rapid task-specific adaptation

Use GRPO (full training) when:

Maximum performance required
Have 4-8 GPUs available
Building production systems
Want teacher to learn new capabilities (not just prompt improvements)

Use both (hybrid approach):

Train teacher foundation with GRPO
Use GEPA for rapid domain adaptation
This combines deep learning with fast iteration

Cost & Time Estimates

From ATLAS optimization experience (~40 iterations, 10 examples):

Student baseline:   40 × 10 × gpt-4o-mini  ≈ $0.50
Teacher guidance:   40 × 10 × gpt-5        ≈ $4.00
Student enhanced:   40 × 10 × gpt-4o-mini  ≈ $0.50
Reflection LLM:     40 × gpt-4             ≈ $4.00
RIM scoring:        800 × Gemini Flash     ≈ $1.00
                                    Total  ≈ $10.00

Time: ~2 hours (depends on API latency and parallel workers)

Performance Tips from Research

1. Provide Informative Seed Prompts The paper shows GEPA benefits from reasonable starting points. Bad seeds can converge slowly. 2. Use Representative Training Data From experiments: 10-150 well-chosen examples were sufficient. Quality > quantity. 3. Leverage Pareto Diversity The paper demonstrates Pareto selection prevents premature convergence. If all candidates look similar, try higher reflection temperature. 4. Monitor Evolution Trajectory Paper shows most gains happen in first 10-20 iterations. Check intermediate results to validate direction.

Comparison to Alternatives

vs DSPy / MIPROv2

From Table 1: GEPA outperforms MIPROv2 (leading prompt optimizer) by +9.23% aggregate on Qwen3 8B. Key difference: GEPA uses reflective mutation (LLM analyzes failures) vs MIPROv2’s Bayesian optimization (statistical search).

vs Reinforcement Learning (GRPO)

From paper results:

GRPO: Optimizes model weights, 24k rollouts, needs GPUs
GEPA: Optimizes prompts, 678-6,858 rollouts, API-only

GEPA achieves comparable or better performance with 35× fewer rollouts.

vs Manual Prompt Engineering

GEPA systematically explores prompt space using data-driven feedback, while manual engineering relies on intuition. Research shows GEPA finds non-obvious improvements humans miss.

Common Issues

Scores plateau early

From paper: This can indicate seed prompt is near-optimal or task is too easy.Fix: Try harder examples or deliberately worse seed to see optimization curve.

All prompts converge to similar text

Cause: Reflection LLM temperature too low or insufficient diversity.Fix:

gepa_config:
  reflection_temperature: 0.9  # Higher exploration

Performance oscillates

From paper: Normal exploration behavior. Pareto selection maintains best candidates even if intermediate generations explore poor strategies.This is expected - final result will be good.

Citation

If you use GEPA in your research, please cite:

@article{agrawal2025gepa,
  title={GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning},
  author={Agrawal, Lakshya A and Tan, Shangyin and Soylu, Dilara and Ziems, Noah and Khare, Rishi and Opsahl-Ong, Krista and Singhvi, Arnav and Shandilya, Herumb and Ryan, Michael J and Jiang, Meng and Potts, Christopher and Sen, Koushik and Dimakis, Alexandros G and Stoica, Ion and Klein, Dan and Zaharia, Matei and Khattab, Omar},
  journal={arXiv preprint arXiv:2507.19457},
  year={2025}
}

Next Steps

Run GEPA Optimization

Try GEPA on your own data

Read the Paper

Full technical details and proofs

GEPA Package

Source code and documentation

Full Training Guide

When you’re ready for GRPO

References

Agrawal, L. A., Tan, S., Soylu, D., et al. (2025). GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. arXiv:2507.19457.
Opsahl-Ong, K., et al. (2024). MIPROv2: Multi-Prompt Optimization via Bayesian Search.
Shao, Z., et al. (2024). Group Relative Policy Optimization (GRPO).
Khattab, O., et al. (2024). DSPy: Programming—not prompting—Foundation Models.

Getting Started

Core Concepts

Guides

Examples & Case Studies

Integration

Training

Benchmarks

Architecture

API Reference

Reference

What is GEPA?

Overview

Why GEPA?

How It Works

1. Reflective Mutation

2. Pareto Frontier Selection

3. Genetic Evolution Loop

Real Example from Research

Integration with ATLAS

Research-Validated Performance

Sample Efficiency

Final Performance (Test Set)

When to Use GEPA

Cost & Time Estimates

Performance Tips from Research

Comparison to Alternatives

vs DSPy / MIPROv2

vs Reinforcement Learning (GRPO)

vs Manual Prompt Engineering

Common Issues

Citation

Next Steps

Run GEPA Optimization

Read the Paper

GEPA Package

Full Training Guide

References

Getting Started

Core Concepts

Guides

Examples & Case Studies

Integration

Training

Benchmarks

Architecture

API Reference

Reference

​Overview

​Why GEPA?

​How It Works

​1. Reflective Mutation

​2. Pareto Frontier Selection

​3. Genetic Evolution Loop

​Real Example from Research

​Integration with ATLAS

​Research-Validated Performance

​Sample Efficiency

​Final Performance (Test Set)

​When to Use GEPA

​Cost & Time Estimates

​Performance Tips from Research

​Comparison to Alternatives

​vs DSPy / MIPROv2

​vs Reinforcement Learning (GRPO)

​vs Manual Prompt Engineering

​Common Issues

​Citation

​Next Steps

Run GEPA Optimization

Read the Paper

GEPA Package

Full Training Guide

​References

Overview

Why GEPA?

How It Works

1. Reflective Mutation

2. Pareto Frontier Selection

3. Genetic Evolution Loop

Real Example from Research

Integration with ATLAS

Research-Validated Performance

Sample Efficiency

Final Performance (Test Set)

When to Use GEPA

Cost & Time Estimates

Performance Tips from Research

Comparison to Alternatives

vs DSPy / MIPROv2

vs Reinforcement Learning (GRPO)

vs Manual Prompt Engineering

Common Issues

Citation

Next Steps

References