SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods for large language models (LLMs) suffer from coarse-grained assessment and lack skill-level interpretability. Method: This paper proposes an unsupervised, tree-structured diagnostic framework: it employs LLM-as-judge to hierarchically attribute model responses, constructs a capability dendrogram via hierarchical clustering for arbitrary fine-grained capability decomposition, and—novelty—integrates tree search into few-shot example selection to enable interpretable weakness localization and prediction. The framework requires no human annotation and is both general-purpose and inherently interpretable. Results: Experiments demonstrate a 25% improvement in in-context learning performance, and the framework achieves 55% accuracy in predicting model weaknesses—outperforming baseline methods by 22 percentage points.

Technology Category

Application Category

📝 Abstract
As language models evolve to tackle complex, multifaceted tasks, their evaluation must adapt to capture this intricacy. A granular, skill-specific understanding of model capabilities can empower researchers to make informed model development plans. In this paper, we introduce SkillVerse, an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities. With LLM as a judge, SkillVerse first critiques the model responses, and then organizes them into a hierarchical structure termed dendrogram. Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models. We also demonstrate its efficacy in two downstream tasks: 1) improving model in-context learning by 25% using a tree-search algorithm to select more informative few-shot demonstrations, and 2) accurately predicting new model weaknesses with a 55% success rate, 22% higher than without SkillVerse.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' complex task performance granularly
Unsupervised tree-structured diagnosis for model proficiency
Enhancing model learning and predicting weaknesses accurately
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised tree-structured diagnosis framework
LLM as judge for hierarchical response organization
Tree-search algorithm for improved in-context learning
🔎 Similar Papers
No similar papers found.
Y
Yufei Tian
University of California, Los Angeles
Jiao Sun
Jiao Sun
Google DeepMind
Natural Language Generation
N
Nanyun Peng
University of California, Los Angeles
Z
Zizhao Zhang
Google Cloud AI