🤖 AI Summary
Existing LLM automated evaluation benchmarks (e.g., MT-Bench, Arena-Hard) provide only aggregate scores, limiting their utility for model optimization and behavioral analysis. This paper advocates a paradigm shift—from ranking-oriented assessment to actionable, fine-grained feedback generation. Our method introduces three core innovations: (1) PC², the first pointwise–pairwise hybrid evaluation framework, jointly optimizing pairwise accuracy and pointwise efficiency; (2) a hierarchical, extensible query taxonomy coupled with an automated query synthesis mechanism; and (3) tight integration of LLM-as-a-Judge with interactive, visual analytics tools. Evaluated across 17 state-of-the-art LLMs, our approach significantly improves weakness localization precision and yields concrete, interpretable optimization guidance. It thus advances evaluation objectives beyond “which model is better” toward “why a model excels or underperforms,” enabling deeper, more actionable insights for iterative model development.
📝 Abstract
Automatic evaluation benchmarks such as MT-Bench, Arena-Hard, and Auto-Arena are seeing growing adoption for the evaluation of Large Language Models (LLMs). Existing research has primarily focused on approximating human-based model rankings using limited data and LLM-as-a-Judge. However, the fundamental premise of these studies, which attempts to replicate human rankings, is flawed. Specifically, these benchmarks typically offer only overall scores, limiting their utility to leaderboard rankings, rather than providing feedback that can guide model optimization and support model profiling. Therefore, we advocate for an evaluation paradigm shift from approximating human-based model rankings to providing feedback with analytical value. To this end, we introduce Feedbacker, an evaluation framework that provides comprehensive and fine-grained results, thereby enabling thorough identification of a model's specific strengths and weaknesses. Such feedback not only supports the targeted optimization of the model but also enhances the understanding of its behavior. Feedbacker comprises three key components: an extensible tree-based query taxonomy builder, an automated query synthesis scheme, and a suite of visualization and analysis tools. Furthermore, we propose a novel LLM-as-a-Judge method: PC2 (Pre-Comparison-derived Criteria) pointwise evaluation. This method derives evaluation criteria by pre-comparing the differences between several auxiliary responses, achieving the accuracy of pairwise evaluation while maintaining the time complexity of pointwise evaluation. Finally, leveraging the evaluation results of 17 mainstream LLMs, we demonstrate the usage of Feedbacker and highlight its effectiveness and potential. Our homepage project is available at https://liudan193.github.io/Feedbacker.