🤖 AI Summary
Existing prompt evolution methods primarily optimize individual prompts in isolation, neglecting the potential synergistic gains from collaborative multi-prompt ensembles.
Method: We propose C-Evolve, a consensus-driven prompt group evolution framework that employs majority voting over model outputs as the fitness function to guide population-based evolution toward highly consistent and high-performing prompt sets. To preserve diversity and overcome stagnation inherent in single-prompt optimization, C-Evolve adopts an island-model evolutionary architecture with inter-island migration.
Contribution/Results: Evaluated on HotpotQA and MATH, C-Evolve achieves significant improvements over baselines (e.g., GEPA) on both Qwen3-8B and GPT-4.1-mini. It is the first work to empirically demonstrate that “prompt ensemble consensus” can systematically enhance reasoning capabilities of closed-source LMs. This establishes a novel paradigm for prompt engineering—shifting focus from isolated prompt tuning to cooperative, population-level prompt optimization.
📝 Abstract
Prompt evolution algorithms offer a powerful paradigm for enhancing AI systems based on closed-source models, while few work explores whether aggregating results from multiple prompts to reach a consensus can further advance the system capability boundary. In this paper, we introduce Consensus-Evolve (C-Evolve), an evolutionary algorithm that discovers a group of prompts whose aggregated outputs after majority voting achieve optimal performance. More specifically, C-Evolve employs an island-based evolutionary algorithm to maintain population diversity, and prompts from distinct islands are selected to form groups to aggregate their outputs. The key difference from single individual evolution is a voting score, which evaluates each individual prompt's contribution within groups. We take this as the fitness score for evolution instead of individual performance. Consequently, C-Evolve is more likely to produce and maintain prompts with higher potential to form a high-performing group and eliminate low-performing ones, gradually improving the group performance after reaching consensus. Our method achieves state-of-the-art performance across a wide range of tasks, including both open-ended tasks like HotpotQA and closed-ended tasks like MATH. On Qwen3-8B, C-Evolve achieves 70.67% on HotpotQA and 43.88% on IFBench, which are 4.95% and 2.73% higher than GEPA, respectively. For GPT-4.1-mini, the accuracy on IFBench is further improved to 47.96% and reaches 95.33% in the MATH benchmark. These results demonstrate the C-Evolve's competitive performance.