From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback

📅 2025-05-10

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing LLM automated evaluation benchmarks (e.g., MT-Bench, Arena-Hard) provide only aggregate scores, limiting their utility for model optimization and behavioral analysis. This paper advocates a paradigm shift—from ranking-oriented assessment to actionable, fine-grained feedback generation. Our method introduces three core innovations: (1) PC², the first pointwise–pairwise hybrid evaluation framework, jointly optimizing pairwise accuracy and pointwise efficiency; (2) a hierarchical, extensible query taxonomy coupled with an automated query synthesis mechanism; and (3) tight integration of LLM-as-a-Judge with interactive, visual analytics tools. Evaluated across 17 state-of-the-art LLMs, our approach significantly improves weakness localization precision and yields concrete, interpretable optimization guidance. It thus advances evaluation objectives beyond “which model is better” toward “why a model excels or underperforms,” enabling deeper, more actionable insights for iterative model development.

Technology Category

Application Category

📝 Abstract

Automatic evaluation benchmarks such as MT-Bench, Arena-Hard, and Auto-Arena are seeing growing adoption for the evaluation of Large Language Models (LLMs). Existing research has primarily focused on approximating human-based model rankings using limited data and LLM-as-a-Judge. However, the fundamental premise of these studies, which attempts to replicate human rankings, is flawed. Specifically, these benchmarks typically offer only overall scores, limiting their utility to leaderboard rankings, rather than providing feedback that can guide model optimization and support model profiling. Therefore, we advocate for an evaluation paradigm shift from approximating human-based model rankings to providing feedback with analytical value. To this end, we introduce Feedbacker, an evaluation framework that provides comprehensive and fine-grained results, thereby enabling thorough identification of a model's specific strengths and weaknesses. Such feedback not only supports the targeted optimization of the model but also enhances the understanding of its behavior. Feedbacker comprises three key components: an extensible tree-based query taxonomy builder, an automated query synthesis scheme, and a suite of visualization and analysis tools. Furthermore, we propose a novel LLM-as-a-Judge method: PC2 (Pre-Comparison-derived Criteria) pointwise evaluation. This method derives evaluation criteria by pre-comparing the differences between several auxiliary responses, achieving the accuracy of pairwise evaluation while maintaining the time complexity of pointwise evaluation. Finally, leveraging the evaluation results of 17 mainstream LLMs, we demonstrate the usage of Feedbacker and highlight its effectiveness and potential. Our homepage project is available at https://liudan193.github.io/Feedbacker.

Problem

Research questions and friction points this paper is trying to address.

Shifting evaluation focus from leaderboard rankings to actionable feedback

Providing comprehensive analysis for identifying model strengths and weaknesses

Introducing Feedbacker framework for fine-grained LLM evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feedbacker framework provides fine-grained model evaluation

Tree-based query taxonomy and automated query synthesis

PC2 pointwise evaluation improves accuracy and efficiency

🔎 Similar Papers

On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

2024-07-04IEEE Transactions on Software EngineeringCitations: 0

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Authors to Follow