🤖 AI Summary
Evaluation-oriented large language models (LLMs) suffer from preference non-transitivity (e.g., cyclic preferences A≻B≻C≻A) and low overall preference clarity, primarily due to low-quality training data. Method: We propose ELSPR, a data self-purification framework based on tournament graph reconstruction. It is the first to formalize non-transitivity as structural anomalies in directed tournament graphs, introduces directed graph structural entropy to quantify preference consistency, and designs an end-to-end, provably transitivity-improving self-purification strategy. Results: Evaluated on Qwen2.5-Max, ELSPR reduces non-transitivity by 13.78%, lowers structural entropy by 0.0879, increases agreement with human judgments by 0.6%, and improves Spearman correlation by 0.01. This work provides both theoretical foundations and practical methodologies for building reliable, interpretable LLM evaluators.
📝 Abstract
Large language models (LLMs) are widely used as evaluators for open-ended tasks, while previous research has emphasized biases in LLM evaluations, the issue of non-transitivity in pairwise comparisons remains unresolved: non-transitive preferences for pairwise comparisons, where evaluators prefer A over B, B over C, but C over A. Our results suggest that low-quality training data may reduce the transitivity of preferences generated by the Evaluator LLM. To address this, We propose a graph-theoretic framework to analyze and mitigate this problem by modeling pairwise preferences as tournament graphs. We quantify non-transitivity and introduce directed graph structural entropy to measure the overall clarity of preferences. Our analysis reveals significant non-transitivity in advanced Evaluator LLMs (with Qwen2.5-Max exhibiting 67.96%), as well as high entropy values (0.8095 for Qwen2.5-Max), reflecting low overall clarity of preferences. To address this issue, we designed a filtering strategy, ELSPR, to eliminate preference data that induces non-transitivity, retaining only consistent and transitive preference data for model fine-tuning. Experiments demonstrate that models fine-tuned with filtered data reduce non-transitivity by 13.78% (from 64.28% to 50.50%), decrease structural entropy by 0.0879 (from 0.8113 to 0.7234), and align more closely with human evaluators (human agreement rate improves by 0.6% and Spearman correlation increases by 0.01).