EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the lack of interpretability and actionability in identifying weaknesses of large language models (LLMs). We propose a capability-tree-driven paradigm for generating interpretable weakness profiles: a hierarchical, semantically grounded capability tree is constructed, and benchmark instances are mapped to natural-language-described capability nodes to enable fine-grained, interpretable weakness localization. The framework integrates hierarchical clustering, instance–capability alignment, and quantitative evaluation metrics to support weakness-guided data collection and bias diagnosis in evaluation protocols. An interactive capability tree exploration interface is also provided. Evaluated on MATH and WildChat, our method significantly improves weakness identification accuracy (+12.3%) and coverage (+18.7%). Weakness-informed data collection yields greater performance gains than mainstream baselines. Moreover, we systematically uncover— for the first time—capability-dimensional biases embedded in Chatbot Arena’s human voting process.

Technology Category

Application Category

📝 Abstract

An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. Toward these goals for Language Model (LM) evaluations, we formulate the problem of generating a weakness profile, a set of weaknesses expressed in natural language, given an LM's performance on every individual instance in a benchmark. We introduce a suite of quantitative assessments to compare different weakness profiling methods. We also propose a weakness profiling method EvalTree. It constructs a capability tree where each node represents a capability described in natural language and is linked to a subset of benchmark instances that specifically evaluate this capability; it then extracts nodes where the LM performs poorly to generate a weakness profile. On the MATH and WildChat benchmarks, we show that EvalTree outperforms baseline weakness profiling methods by identifying weaknesses more precisely and comprehensively. Weakness profiling further enables weakness-guided data collection, and training data collection guided by EvalTree-identified weaknesses improves LM performance more than other data collection strategies. We also show how EvalTree exposes flaws in Chatbot Arena's human-voter-based evaluation practice. To facilitate future work, we release our code and an interface that allows practitioners to interactively explore the capability trees built by EvalTree.

Problem

Research questions and friction points this paper is trying to address.

Identifies language model weaknesses via hierarchical capability trees.

Compares weakness profiling methods using quantitative assessments.

Improves model performance through weakness-guided data collection.

Innovation

Methods, ideas, or system contributions that make the work stand out.

EvalTree constructs hierarchical capability trees for profiling.

Identifies LM weaknesses using natural language descriptions.

Enables weakness-guided data collection for LM improvement.

🔎 Similar Papers

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models