Overview
GEPA (Genetic-Pareto) is an external optimization package that ATLAS uses to evolve teaching prompts through LLM-based reflection and Pareto-efficient evolutionary search. Published research shows GEPA achieves substantial performance gains with remarkable sample efficiency.Package:
gepa
v0.0.12
Source: https://github.com/gepa-ai/gepa
Paper: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (Agrawal et al., 2025)
License: MITWhy GEPA?
Traditional approaches to optimizing AI systems face trade-offs between speed, cost, and infrastructure requirements. GEPA offers a middle ground:Approach | Rollouts Required | Infrastructure | Key Benefit |
---|---|---|---|
Manual prompt engineering | N/A | None | Zero cost |
GEPA | 678-6,858 | None (API only) | Fast + sample-efficient |
GRPO (RL training) | 24,000 | 4-8 GPUs + LoRA | Maximum performance |
- HotpotQA: 62.33% (GEPA) vs 43.33% (GRPO) = +19.0% improvement
- IFBench: 55.66% (GEPA) vs 52.93% (GRPO) = +2.73% improvement
- HoVer: 51.66% (GEPA) vs 38.00% (GRPO) = +13.66% improvement
- PUPA: 94.69% (GEPA) vs 89.50% (GRPO) = +5.19% improvement
How It Works
GEPA combines three core mechanisms from the research paper:1. Reflective Mutation
Instead of random changes, GEPA uses an LLM to analyze failures and propose targeted improvements:2. Pareto Frontier Selection
GEPA maintains multiple high-performing candidates rather than just the single best:3. Genetic Evolution Loop
The complete GEPA algorithm (from Figure 3 in the paper):- Initialize with seed prompts
- While budget remains:
- Select a parent from Pareto frontier
- Mutate via reflective LLM
- Evaluate on minibatch
- If improved: Add to pool, evaluate on full dataset
- If not improved: Discard
- Return best candidate from final pool
Real Example from Research
The paper shows GEPA’s evolution for multi-hop QA (Figure 2): Seed prompt (generic):Integration with ATLAS
ATLAS uses GEPA to optimize teaching protocol templates:- Adapter: Runs ATLAS protocol (baseline → teaching → enhanced) and scores with RIM
- GEPA: Evolves prompts via reflection and Pareto selection
- Reflection LLM: Analyzes failures and proposes improvements
Research-Validated Performance
Sample Efficiency
From the paper’s experiments on Qwen3 8B:Task | GEPA Rollouts to Match GRPO | GRPO Rollouts | Efficiency Gain |
---|---|---|---|
HotpotQA | 402 | 24,000 | 60× |
IFBench | 330 | 24,000 | 73× |
HoVer | 1,179 | 24,000 | 20× |
PUPA | 306 | 24,000 | 78× |
Final Performance (Test Set)
Qwen3 8B Results (Table 1 from paper):Method | HotpotQA | IFBench | HoVer | PUPA | Aggregate |
---|---|---|---|---|---|
Baseline | 42.33 | 30.44 | 38.00 | 91.63 | 50.60 |
MIPROv2 | 55.33 | 38.61 | 40.66 | 92.85 | 56.86 |
GRPO (24k rollouts) | 43.33 | 52.93 | 38.00 | 89.50 | 55.94 |
GEPA | 62.33 | 55.66 | 51.66 | 94.69 | 66.09 |
- vs Baseline: +15.49% aggregate
- vs MIPROv2: +9.23% aggregate
- vs GRPO: +10.15% aggregate
When to Use GEPA
Use GEPA when:- Budget limited (~100-1000 for full training)
- No GPU infrastructure available
- Need results in hours, not days
- Testing if ATLAS helps your domain
- Rapid task-specific adaptation
- Maximum performance required
- Have 4-8 GPUs available
- Building production systems
- Want teacher to learn new capabilities (not just prompt improvements)
- Train teacher foundation with GRPO
- Use GEPA for rapid domain adaptation
- This combines deep learning with fast iteration
Cost & Time Estimates
From ATLAS optimization experience (~40 iterations, 10 examples):Performance Tips from Research
1. Provide Informative Seed Prompts The paper shows GEPA benefits from reasonable starting points. Bad seeds can converge slowly. 2. Use Representative Training Data From experiments: 10-150 well-chosen examples were sufficient. Quality > quantity. 3. Leverage Pareto Diversity The paper demonstrates Pareto selection prevents premature convergence. If all candidates look similar, try higher reflection temperature. 4. Monitor Evolution Trajectory Paper shows most gains happen in first 10-20 iterations. Check intermediate results to validate direction.Comparison to Alternatives
vs DSPy / MIPROv2
From Table 1: GEPA outperforms MIPROv2 (leading prompt optimizer) by +9.23% aggregate on Qwen3 8B. Key difference: GEPA uses reflective mutation (LLM analyzes failures) vs MIPROv2’s Bayesian optimization (statistical search).vs Reinforcement Learning (GRPO)
From paper results:- GRPO: Optimizes model weights, 24k rollouts, needs GPUs
- GEPA: Optimizes prompts, 678-6,858 rollouts, API-only
vs Manual Prompt Engineering
GEPA systematically explores prompt space using data-driven feedback, while manual engineering relies on intuition. Research shows GEPA finds non-obvious improvements humans miss.Common Issues
Scores plateau early
Scores plateau early
From paper: This can indicate seed prompt is near-optimal or task is too easy.Fix: Try harder examples or deliberately worse seed to see optimization curve.
All prompts converge to similar text
All prompts converge to similar text
Cause: Reflection LLM temperature too low or insufficient diversity.Fix:
Performance oscillates
Performance oscillates
From paper: Normal exploration behavior. Pareto selection maintains best candidates even if intermediate generations explore poor strategies.This is expected - final result will be good.
Citation
If you use GEPA in your research, please cite:Next Steps
Run GEPA Optimization
Try GEPA on your own data
Read the Paper
Full technical details and proofs
GEPA Package
Source code and documentation
Full Training Guide
When you’re ready for GRPO
References
- Agrawal, L. A., Tan, S., Soylu, D., et al. (2025). GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. arXiv:2507.19457.
- Opsahl-Ong, K., et al. (2024). MIPROv2: Multi-Prompt Optimization via Bayesian Search.
- Shao, Z., et al. (2024). Group Relative Policy Optimization (GRPO).
- Khattab, O., et al. (2024). DSPy: Programming—not prompting—Foundation Models.