🤖 AI Summary
This work identifies baseline-dependent ranking bias in LLM automatic evaluation arising from violations of the transitivity assumption in pairwise comparison methods. It provides the first systematic empirical validation—within the AlpacaEval framework—that LLM judges exhibit significant non-transitive preferences. To address this, we propose Swiss-Wise Iterative Matchmaking (Swim), a tournament-based ranking mechanism that integrates Bradley–Terry modeling, dynamic opponent selection, and round-robin scheduling to ensure ranking reliability while substantially reducing computational cost. Experiments demonstrate that Swim achieves Spearman and Kendall correlations of 96.4% and 86.3%, respectively, with human rankings from Chatbot Arena—outperforming conventional pairwise baselines. Swim thus establishes a more robust, efficient, and interpretable ranking paradigm for LLM evaluation.
📝 Abstract
Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% ->96.4% and 82.1% ->86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.