DEEPQUESTION: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While large language models (LLMs) achieve strong performance on standard benchmarks, they exhibit severe limitations in real-world higher-order tasks demanding deep reasoning and creative thinking. Method: We propose DeepQuestion—a framework that hierarchically augments existing datasets grounded in Bloom’s Taxonomy, designs novel traceable-solution-path questions, and introduces the first automated evaluation paradigm integrating educational cognitive theory with path-inversion algorithms. Contribution/Results: Extensive evaluation across 10 mainstream open- and closed-weight LLMs reveals up to a 70% accuracy drop on evaluative and generative (i.e., highest-level) cognitive tasks, exposing fundamental bottlenecks in deep reasoning. This work establishes a new standard for cognitively diverse evaluation and releases an open-source benchmark dataset to support reproducible, theory-informed LLM assessment.

Technology Category

Application Category

📝 Abstract
LLMs often excel on standard benchmarks but falter on real-world tasks. We introduce DeepQuestion, a scalable automated framework that augments existing datasets based on Bloom's taxonomy and creates novel questions that trace original solution paths to probe evaluative and creative skills. Extensive experiments across ten open-source and proprietary models, covering both general-purpose and reasoning LLMs, reveal substantial performance drops (even up to 70% accuracy loss) on higher-order tasks, underscoring persistent gaps in deep reasoning. Our work highlights the need for cognitively diverse benchmarks to advance LLM progress. DeepQuestion and related datasets will be released upon acceptance of the paper.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' real-world task performance gaps
Automating diverse question generation for deeper reasoning assessment
Addressing accuracy drops in higher-order cognitive tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable automated framework for dataset augmentation
Bloom's taxonomy-based novel question generation
Traces original solution paths for deep reasoning
🔎 Similar Papers
No similar papers found.