Tournament of Prompts: Evolving LLM Instructions Through Structured Debates and Elo Ratings

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

For complex prompt engineering tasks—such as subjective quality assessment—that lack well-defined optimization objectives, existing methods rely on handcrafted numerical metrics or ground-truth labels, failing to capture nuanced user requirements. This paper proposes Debate-Elo, an unsupervised prompt evolution framework requiring neither annotations nor predefined objective functions. Its core innovations include: (i) the first Elo-based relative scoring mechanism grounded in structured multi-round debates among LLMs; (ii) semantic-preserving discrete crossover and strategy-aware mutation operators for efficient search in the discrete prompt space; and (iii) an LLM-driven self-feedback loop enabling continuous iterative refinement. Experiments demonstrate that Debate-Elo significantly outperforms both human-crafted prompts and state-of-the-art automated prompt optimization methods across open- and closed-ended tasks, establishing a new paradigm for unsupervised prompt optimization.

Technology Category

Application Category

📝 Abstract

Prompt engineering represents a critical bottleneck to harness the full potential of Large Language Models (LLMs) for solving complex tasks, as it requires specialized expertise, significant trial-and-error, and manual intervention. This challenge is particularly pronounced for tasks involving subjective quality assessment, where defining explicit optimization objectives becomes fundamentally problematic. Existing automated prompt optimization methods falter in these scenarios, as they typically require well-defined task-specific numerical fitness functions or rely on generic templates that cannot capture the nuanced requirements of complex use cases. We introduce DEEVO (DEbate-driven EVOlutionary prompt optimization), a novel framework that guides prompt evolution through a debate-driven evaluation with an Elo-based selection. Contrary to prior work, DEEVOs approach enables exploration of the discrete prompt space while preserving semantic coherence through intelligent crossover and strategic mutation operations that incorporate debate-based feedback, combining elements from both successful and unsuccessful prompts based on identified strengths rather than arbitrary splicing. Using Elo ratings as a fitness proxy, DEEVO simultaneously drives improvement and preserves valuable diversity in the prompt population. Experimental results demonstrate that DEEVO significantly outperforms both manual prompt engineering and alternative state-of-the-art optimization approaches on open-ended tasks and close-ended tasks despite using no ground truth feedback. By connecting LLMs reasoning capabilities with adaptive optimization, DEEVO represents a significant advancement in prompt optimization research by eliminating the need of predetermined metrics to continuously improve AI systems.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM prompts without predefined metrics

Addressing subjective quality assessment in prompt engineering

Automating prompt evolution via debate-driven feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Debate-driven evolutionary prompt optimization framework

Elo-based selection for prompt improvement

Semantic coherence via intelligent crossover and mutation

🔎 Similar Papers

Evaluating the Performance of Large Language Models via Debates