HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning

📅 2025-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluations predominantly focus on horizontal or flat-structure reasoning, overlooking the critical capability of hierarchical reasoning. Method: We propose HiBench, the first systematic benchmark for hierarchical reasoning: (1) we formally define and quantify five core dimensions of hierarchical reasoning; (2) we design 30 multi-difficulty tasks—spanning generation, comprehension, and modification—covering 39,519 queries across six realistic scenarios, including implicit hierarchy identification and dynamic structural editing; (3) we introduce structured prompting, multi-granularity difficulty control, and a five-dimensional evaluation framework, releasing an open-source toolchain and high-quality instruction dataset. Contribution/Results: Evaluated on 20 mainstream models, we find that explicit hierarchical reasoning is moderately robust, yet implicit and dynamic hierarchical modeling remains severely deficient. Lightweight instruction tuning yields average improvements of 88.84% (Llama-3.1-8B) and 31.38% (Qwen2.5-7B), demonstrating HiBench’s efficacy in diagnosing and enhancing hierarchical reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Structure reasoning is a fundamental capability of large language models (LLMs), enabling them to reason about structured commonsense and answer multi-hop questions. However, existing benchmarks for structure reasoning mainly focus on horizontal and coordinate structures (emph{e.g.} graphs), overlooking the hierarchical relationships within them. Hierarchical structure reasoning is crucial for human cognition, particularly in memory organization and problem-solving. It also plays a key role in various real-world tasks, such as information extraction and decision-making. To address this gap, we propose HiBench, the first framework spanning from initial structure generation to final proficiency assessment, designed to benchmark the hierarchical reasoning capabilities of LLMs systematically. HiBench encompasses six representative scenarios, covering both fundamental and practical aspects, and consists of 30 tasks with varying hierarchical complexity, totaling 39,519 queries. To evaluate LLMs comprehensively, we develop five capability dimensions that depict different facets of hierarchical structure understanding. Through extensive evaluation of 20 LLMs from 10 model families, we reveal key insights into their capabilities and limitations: 1) existing LLMs show proficiency in basic hierarchical reasoning tasks; 2) they still struggle with more complex structures and implicit hierarchical representations, especially in structural modification and textual reasoning. Based on these findings, we create a small yet well-designed instruction dataset, which enhances LLMs' performance on HiBench by an average of 88.84% (Llama-3.1-8B) and 31.38% (Qwen2.5-7B) across all tasks. The HiBench dataset and toolkit are available here, https://github.com/jzzzzh/HiBench, to encourage evaluation.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking hierarchical reasoning in large language models (LLMs).
Addressing gaps in hierarchical structure reasoning benchmarks.
Enhancing LLMs' performance on complex hierarchical reasoning tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

HiBench benchmarks hierarchical reasoning in LLMs
Includes 30 tasks with 39,519 queries
Enhances LLMs performance with instruction dataset
🔎 Similar Papers
2024-02-26Annual Meeting of the Association for Computational LinguisticsCitations: 97
Zhuohang Jiang
Zhuohang Jiang
PolyU
LLMRAGRecSys
Pangjing Wu
Pangjing Wu
The Hong Kong Polytechnic University
Reinforcement LearningNatural Language ProcessingData Mining
Ziran Liang
Ziran Liang
The Hong Kong Polytechnic University
Large Language Models (LLMs)Embedding Representation Learningand Time Series Forecasting
Peter Q. Chen
Peter Q. Chen
The Hong Kong Polytechnic University
X
Xu Yuan
The Hong Kong Polytechnic University, Hong Kong
Y
Ye Jia
The Hong Kong Polytechnic University, Hong Kong
Jiancheng Tu
Jiancheng Tu
The Hong Kong Polytechnic University
Interpretable machine learningcredit scoringAHP
C
Chen Li
The Hong Kong Polytechnic University, Hong Kong
P
P. H. Ng
The Hong Kong Polytechnic University, Hong Kong
Q
Qing Li
The Hong Kong Polytechnic University, Hong Kong