C-Evolve: Consensus-based Evolution for Prompt Groups

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing prompt evolution methods primarily optimize individual prompts in isolation, neglecting the potential synergistic gains from collaborative multi-prompt ensembles. Method: We propose C-Evolve, a consensus-driven prompt group evolution framework that employs majority voting over model outputs as the fitness function to guide population-based evolution toward highly consistent and high-performing prompt sets. To preserve diversity and overcome stagnation inherent in single-prompt optimization, C-Evolve adopts an island-model evolutionary architecture with inter-island migration. Contribution/Results: Evaluated on HotpotQA and MATH, C-Evolve achieves significant improvements over baselines (e.g., GEPA) on both Qwen3-8B and GPT-4.1-mini. It is the first work to empirically demonstrate that “prompt ensemble consensus” can systematically enhance reasoning capabilities of closed-source LMs. This establishes a novel paradigm for prompt engineering—shifting focus from isolated prompt tuning to cooperative, population-level prompt optimization.

Technology Category

Application Category

📝 Abstract

Prompt evolution algorithms offer a powerful paradigm for enhancing AI systems based on closed-source models, while few work explores whether aggregating results from multiple prompts to reach a consensus can further advance the system capability boundary. In this paper, we introduce Consensus-Evolve (C-Evolve), an evolutionary algorithm that discovers a group of prompts whose aggregated outputs after majority voting achieve optimal performance. More specifically, C-Evolve employs an island-based evolutionary algorithm to maintain population diversity, and prompts from distinct islands are selected to form groups to aggregate their outputs. The key difference from single individual evolution is a voting score, which evaluates each individual prompt's contribution within groups. We take this as the fitness score for evolution instead of individual performance. Consequently, C-Evolve is more likely to produce and maintain prompts with higher potential to form a high-performing group and eliminate low-performing ones, gradually improving the group performance after reaching consensus. Our method achieves state-of-the-art performance across a wide range of tasks, including both open-ended tasks like HotpotQA and closed-ended tasks like MATH. On Qwen3-8B, C-Evolve achieves 70.67% on HotpotQA and 43.88% on IFBench, which are 4.95% and 2.73% higher than GEPA, respectively. For GPT-4.1-mini, the accuracy on IFBench is further improved to 47.96% and reaches 95.33% in the MATH benchmark. These results demonstrate the C-Evolve's competitive performance.

Problem

Research questions and friction points this paper is trying to address.

Developing group-based prompt evolution via consensus aggregation

Optimizing prompt groups through majority voting for enhanced performance

Advancing AI system capabilities using evolutionary algorithms on prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses island-based evolution to maintain population diversity

Selects prompts from islands to form voting groups

Employs voting score as fitness metric for evolution

🔎 Similar Papers

PhaseEvo: Towards Unified In-Context Prompt Optimization for Large Language Models