From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback

📅 2025-05-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM automated evaluation benchmarks (e.g., MT-Bench, Arena-Hard) provide only aggregate scores, limiting their utility for model optimization and behavioral analysis. This paper advocates a paradigm shift—from ranking-oriented assessment to actionable, fine-grained feedback generation. Our method introduces three core innovations: (1) PC², the first pointwise–pairwise hybrid evaluation framework, jointly optimizing pairwise accuracy and pointwise efficiency; (2) a hierarchical, extensible query taxonomy coupled with an automated query synthesis mechanism; and (3) tight integration of LLM-as-a-Judge with interactive, visual analytics tools. Evaluated across 17 state-of-the-art LLMs, our approach significantly improves weakness localization precision and yields concrete, interpretable optimization guidance. It thus advances evaluation objectives beyond “which model is better” toward “why a model excels or underperforms,” enabling deeper, more actionable insights for iterative model development.

Technology Category

Application Category

📝 Abstract
Automatic evaluation benchmarks such as MT-Bench, Arena-Hard, and Auto-Arena are seeing growing adoption for the evaluation of Large Language Models (LLMs). Existing research has primarily focused on approximating human-based model rankings using limited data and LLM-as-a-Judge. However, the fundamental premise of these studies, which attempts to replicate human rankings, is flawed. Specifically, these benchmarks typically offer only overall scores, limiting their utility to leaderboard rankings, rather than providing feedback that can guide model optimization and support model profiling. Therefore, we advocate for an evaluation paradigm shift from approximating human-based model rankings to providing feedback with analytical value. To this end, we introduce Feedbacker, an evaluation framework that provides comprehensive and fine-grained results, thereby enabling thorough identification of a model's specific strengths and weaknesses. Such feedback not only supports the targeted optimization of the model but also enhances the understanding of its behavior. Feedbacker comprises three key components: an extensible tree-based query taxonomy builder, an automated query synthesis scheme, and a suite of visualization and analysis tools. Furthermore, we propose a novel LLM-as-a-Judge method: PC2 (Pre-Comparison-derived Criteria) pointwise evaluation. This method derives evaluation criteria by pre-comparing the differences between several auxiliary responses, achieving the accuracy of pairwise evaluation while maintaining the time complexity of pointwise evaluation. Finally, leveraging the evaluation results of 17 mainstream LLMs, we demonstrate the usage of Feedbacker and highlight its effectiveness and potential. Our homepage project is available at https://liudan193.github.io/Feedbacker.
Problem

Research questions and friction points this paper is trying to address.

Shifting evaluation focus from leaderboard rankings to actionable feedback
Providing comprehensive analysis for identifying model strengths and weaknesses
Introducing Feedbacker framework for fine-grained LLM evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feedbacker framework provides fine-grained model evaluation
Tree-based query taxonomy and automated query synthesis
PC2 pointwise evaluation improves accuracy and efficiency
🔎 Similar Papers
No similar papers found.