Skip to main content

Overview

GEPA (Genetic-Pareto) is an external optimization package that ATLAS uses to evolve teaching prompts through LLM-based reflection and Pareto-efficient evolutionary search. Published research shows GEPA achieves substantial performance gains with remarkable sample efficiency.
Package: gepa v0.0.12 Source: https://github.com/gepa-ai/gepa Paper: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (Agrawal et al., 2025) License: MIT

Why GEPA?

Traditional approaches to optimizing AI systems face trade-offs between speed, cost, and infrastructure requirements. GEPA offers a middle ground:
ApproachRollouts RequiredInfrastructureKey Benefit
Manual prompt engineeringN/ANoneZero cost
GEPA678-6,858None (API only)Fast + sample-efficient
GRPO (RL training)24,0004-8 GPUs + LoRAMaximum performance
Official benchmark results (from Agrawal et al., 2025, Table 1): On Qwen3 8B across four tasks:
  • HotpotQA: 62.33% (GEPA) vs 43.33% (GRPO) = +19.0% improvement
  • IFBench: 55.66% (GEPA) vs 52.93% (GRPO) = +2.73% improvement
  • HoVer: 51.66% (GEPA) vs 38.00% (GRPO) = +13.66% improvement
  • PUPA: 94.69% (GEPA) vs 89.50% (GRPO) = +5.19% improvement
Average improvement: +10% over GRPO while using up to 35× fewer rollouts GEPA also outperformed MIPROv2 (state-of-the-art prompt optimizer) by +10.3% aggregate across all benchmarks.

How It Works

GEPA combines three core mechanisms from the research paper:

1. Reflective Mutation

Instead of random changes, GEPA uses an LLM to analyze failures and propose targeted improvements:
# Conceptual flow from the paper
def reflective_mutation(current_prompt, failed_examples):
    """Use LLM reflection to intelligently improve prompts."""

    # Step 1: Analyze what went wrong
    reflection = reflection_llm(f"""
    Current prompt: {current_prompt}

    Failed on these examples: {failed_examples}

    Analyze: What specific issues caused these failures?
    What patterns do you see? What should change?
    """)

    # Step 2: Generate improved variant
    new_prompt = reflection_llm(f"""
    Original prompt: {current_prompt}
    Analysis: {reflection}

    Generate an improved prompt addressing these issues.
    """)

    return new_prompt
From the paper: “GEPA iteratively mutates every prompt within the AI system in light of natural language feedback drawn from new rollouts. In each mutation, the candidate prompt is derived from an ancestor, accumulating high-level lessons derived from observations and LLM feedback.”

2. Pareto Frontier Selection

GEPA maintains multiple high-performing candidates rather than just the single best:
Performance on different metrics:

Accuracy ↑
    |
    |    C (kept - strong on accuracy)
    |
    |  A (kept - balanced)    D (discarded - dominated)
    |
    |           B (kept - strong on helpfulness)
    |______________________________________→ Helpfulness
Candidates A, B, and C are all kept because each excels at something. Candidate D is discarded because another prompt dominates it on all metrics. From the paper: “To avoid the local optima that afflict greedy prompt updates, GEPA maintains a Pareto front: instead of evolving only the global best prompt, it stochastically explores the top-performing prompts for each problem instance, thereby diversifying strategies and encouraging robust generalization.”

3. Genetic Evolution Loop

The complete GEPA algorithm (from Figure 3 in the paper):
  1. Initialize with seed prompts
  2. While budget remains:
    • Select a parent from Pareto frontier
    • Mutate via reflective LLM
    • Evaluate on minibatch
    • If improved: Add to pool, evaluate on full dataset
    • If not improved: Discard
  3. Return best candidate from final pool

Real Example from Research

The paper shows GEPA’s evolution for multi-hop QA (Figure 2): Seed prompt (generic):
Given the fields question, summary 1, produce the fields query.
GEPA’s optimized prompt (after reflective evolution):
You will be given two input fields: question and summary 1.
Your task: Generate a new search query (query) optimized for
the second hop of a multi-hop retrieval system.

• The original user question is typically complex and requires
  information from multiple documents to answer.
• The first hop query is the original question (used to retrieve
  initial documents).
• Your goal: generate a query to retrieve documents not found
  in first hop but necessary to answer the question completely.

Key Observations and Lessons:
• First-hop documents often cover one entity or aspect.
• Remaining relevant documents often involve connected or
  higher-level concepts mentioned in summary 1 but not
  explicitly asked in the original question.
• The query should target these missing, but logically linked,
  documents.

[... detailed strategies continue ...]
This evolved prompt led to significant performance gains on HotpotQA benchmark.

Integration with ATLAS

ATLAS uses GEPA to optimize teaching protocol templates:
# How ATLAS calls GEPA (from optimize_teaching.py)
import gepa
from trainers.prompt_adapter import ATLASGEPAAdapter

# Create ATLAS adapter (handles evaluation)
adapter = ATLASGEPAAdapter(
    teacher_model="gpt-5",
    student_model="gpt-4o-mini",
    generation_config={
        "max_tokens": 512,
        "diagnostic_max_tokens": 100,
        "temperature": 0.7
    }
)

# Run GEPA optimization
result = gepa.optimize(
    seed_candidate={
        "teacher_adaptive_template": "You are a teacher...",
        "student_diagnostic_template": "Show your thinking...",
        "student_with_teaching_template": "Apply guidance..."
    },
    trainset=your_examples,
    valset=validation_examples,
    adapter=adapter,  # ATLAS evaluates via full protocol
    reflection_lm=gpt4_reflection,  # LLM for mutations
    max_metric_calls=40,
    candidate_selection_strategy="pareto"
)

optimized_prompts = result.best_candidate
What each component does:
  • Adapter: Runs ATLAS protocol (baseline → teaching → enhanced) and scores with RIM
  • GEPA: Evolves prompts via reflection and Pareto selection
  • Reflection LLM: Analyzes failures and proposes improvements

Research-Validated Performance

Sample Efficiency

From the paper’s experiments on Qwen3 8B:
TaskGEPA Rollouts to Match GRPOGRPO RolloutsEfficiency Gain
HotpotQA40224,00060×
IFBench33024,00073×
HoVer1,17924,00020×
PUPA30624,00078×
GEPA matches GRPO’s validation performance with 20-78× fewer rollouts.

Final Performance (Test Set)

Qwen3 8B Results (Table 1 from paper):
MethodHotpotQAIFBenchHoVerPUPAAggregate
Baseline42.3330.4438.0091.6350.60
MIPROv255.3338.6140.6692.8556.86
GRPO (24k rollouts)43.3352.9338.0089.5055.94
GEPA62.3355.6651.6694.6966.09
GEPA improvements:
  • vs Baseline: +15.49% aggregate
  • vs MIPROv2: +9.23% aggregate
  • vs GRPO: +10.15% aggregate

When to Use GEPA

Use GEPA when:
  • Budget limited (~10vs10 vs 100-1000 for full training)
  • No GPU infrastructure available
  • Need results in hours, not days
  • Testing if ATLAS helps your domain
  • Rapid task-specific adaptation
Use GRPO (full training) when:
  • Maximum performance required
  • Have 4-8 GPUs available
  • Building production systems
  • Want teacher to learn new capabilities (not just prompt improvements)
Use both (hybrid approach):
  • Train teacher foundation with GRPO
  • Use GEPA for rapid domain adaptation
  • This combines deep learning with fast iteration

Cost & Time Estimates

From ATLAS optimization experience (~40 iterations, 10 examples):
Student baseline:   40 × 10 × gpt-4o-mini  ≈ $0.50
Teacher guidance:   40 × 10 × gpt-5        ≈ $4.00
Student enhanced:   40 × 10 × gpt-4o-mini  ≈ $0.50
Reflection LLM:     40 × gpt-4             ≈ $4.00
RIM scoring:        800 × Gemini Flash     ≈ $1.00
                                    Total  ≈ $10.00
Time: ~2 hours (depends on API latency and parallel workers)

Performance Tips from Research

1. Provide Informative Seed Prompts The paper shows GEPA benefits from reasonable starting points. Bad seeds can converge slowly. 2. Use Representative Training Data From experiments: 10-150 well-chosen examples were sufficient. Quality > quantity. 3. Leverage Pareto Diversity The paper demonstrates Pareto selection prevents premature convergence. If all candidates look similar, try higher reflection temperature. 4. Monitor Evolution Trajectory Paper shows most gains happen in first 10-20 iterations. Check intermediate results to validate direction.

Comparison to Alternatives

vs DSPy / MIPROv2

From Table 1: GEPA outperforms MIPROv2 (leading prompt optimizer) by +9.23% aggregate on Qwen3 8B. Key difference: GEPA uses reflective mutation (LLM analyzes failures) vs MIPROv2’s Bayesian optimization (statistical search).

vs Reinforcement Learning (GRPO)

From paper results:
  • GRPO: Optimizes model weights, 24k rollouts, needs GPUs
  • GEPA: Optimizes prompts, 678-6,858 rollouts, API-only
GEPA achieves comparable or better performance with 35× fewer rollouts.

vs Manual Prompt Engineering

GEPA systematically explores prompt space using data-driven feedback, while manual engineering relies on intuition. Research shows GEPA finds non-obvious improvements humans miss.

Common Issues

From paper: This can indicate seed prompt is near-optimal or task is too easy.Fix: Try harder examples or deliberately worse seed to see optimization curve.
Cause: Reflection LLM temperature too low or insufficient diversity.Fix:
gepa_config:
  reflection_temperature: 0.9  # Higher exploration
From paper: Normal exploration behavior. Pareto selection maintains best candidates even if intermediate generations explore poor strategies.This is expected - final result will be good.

Citation

If you use GEPA in your research, please cite:
@article{agrawal2025gepa,
  title={GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning},
  author={Agrawal, Lakshya A and Tan, Shangyin and Soylu, Dilara and Ziems, Noah and Khare, Rishi and Opsahl-Ong, Krista and Singhvi, Arnav and Shandilya, Herumb and Ryan, Michael J and Jiang, Meng and Potts, Christopher and Sen, Koushik and Dimakis, Alexandros G and Stoica, Ion and Klein, Dan and Zaharia, Matei and Khattab, Omar},
  journal={arXiv preprint arXiv:2507.19457},
  year={2025}
}

Next Steps

References

  • Agrawal, L. A., Tan, S., Soylu, D., et al. (2025). GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. arXiv:2507.19457.
  • Opsahl-Ong, K., et al. (2024). MIPROv2: Multi-Prompt Optimization via Bayesian Search.
  • Shao, Z., et al. (2024). Group Relative Policy Optimization (GRPO).
  • Khattab, O., et al. (2024). DSPy: Programming—not prompting—Foundation Models.