DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing benchmarks lack fine-grained evaluation of large language models’ (LLMs) structural reasoning capabilities at the data structure level. To address this, we propose DSR-Bench—the first automated, data-structure-centric benchmark—comprising 20 data structures, 35 operation types, and 4,140 synthetically generated questions. It establishes a hierarchical, fully automated, and subjectivity-free evaluation paradigm grounded in data structures. Leveraging structured prompt engineering, deterministic programmatic assessment, and multidimensional capability decomposition, we evaluate nine state-of-the-art models. Our analysis uncovers fundamental limitations in multi-attribute, multi-hop, and hybrid-structure reasoning: instruction-tuned models exhibit weak foundational structural reasoning; inference-optimized models achieve at most 47% accuracy on challenging subsets; and performance degrades significantly on tasks involving multidimensional data and natural-language descriptions.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed for real-world tasks that fundamentally involve data manipulation. A core requirement across these tasks is the ability to perform structural reasoning--that is, to understand and reason about data relationships. For example, customer requests require a temporal ordering, which can be represented by data structures such as queues. However, existing benchmarks primarily focus on high-level, application-driven evaluations without isolating this fundamental capability. To address this gap, we introduce DSR-Bench, a novel benchmark evaluating LLMs' structural reasoning capabilities through data structures, which provide interpretable representations of data relationships. DSR-Bench includes 20 data structures, 35 operations, and 4,140 problem instances, organized hierarchically for fine-grained analysis of reasoning limitations. Our evaluation pipeline is fully automated and deterministic, eliminating subjective human or model-based judgments. Its synthetic nature also ensures scalability and minimizes data contamination risks. We benchmark nine state-of-the-art LLMs. Our analysis shows that instruction-tuned models struggle with basic multi-attribute and multi-hop reasoning. Furthermore, while reasoning-oriented models perform better, they remain fragile on complex and hybrid structures, with the best model achieving an average score of only 47% on the challenge subset. Crucially, models often perform poorly on multi-dimensional data and natural language task descriptions, highlighting a critical gap for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' structural reasoning via data structures

Assessing multi-attribute and multi-hop reasoning limitations

Testing model performance on complex and hybrid structures

Innovation

Methods, ideas, or system contributions that make the work stand out.

DSR-Bench evaluates LLMs via data structures

Automated deterministic pipeline eliminates subjective judgments

Hierarchical problem instances for fine-grained analysis

🔎 Similar Papers

GraphEval36K: Benchmarking Coding and Reasoning Capabilities of Large Language Models on Graph Datasets