🤖 AI Summary
This work addresses the lack of interpretability and actionability in identifying weaknesses of large language models (LLMs). We propose a capability-tree-driven paradigm for generating interpretable weakness profiles: a hierarchical, semantically grounded capability tree is constructed, and benchmark instances are mapped to natural-language-described capability nodes to enable fine-grained, interpretable weakness localization. The framework integrates hierarchical clustering, instance–capability alignment, and quantitative evaluation metrics to support weakness-guided data collection and bias diagnosis in evaluation protocols. An interactive capability tree exploration interface is also provided. Evaluated on MATH and WildChat, our method significantly improves weakness identification accuracy (+12.3%) and coverage (+18.7%). Weakness-informed data collection yields greater performance gains than mainstream baselines. Moreover, we systematically uncover— for the first time—capability-dimensional biases embedded in Chatbot Arena’s human voting process.
📝 Abstract
An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. Toward these goals for Language Model (LM) evaluations, we formulate the problem of generating a weakness profile, a set of weaknesses expressed in natural language, given an LM's performance on every individual instance in a benchmark. We introduce a suite of quantitative assessments to compare different weakness profiling methods. We also propose a weakness profiling method EvalTree. It constructs a capability tree where each node represents a capability described in natural language and is linked to a subset of benchmark instances that specifically evaluate this capability; it then extracts nodes where the LM performs poorly to generate a weakness profile. On the MATH and WildChat benchmarks, we show that EvalTree outperforms baseline weakness profiling methods by identifying weaknesses more precisely and comprehensively. Weakness profiling further enables weakness-guided data collection, and training data collection guided by EvalTree-identified weaknesses improves LM performance more than other data collection strategies. We also show how EvalTree exposes flaws in Chatbot Arena's human-voter-based evaluation practice. To facilitate future work, we release our code and an interface that allows practitioners to interactively explore the capability trees built by EvalTree.