Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy

πŸ“… 2026-01-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing open-domain question-answering evaluation benchmarks predominantly rely on scarce human-exam data, limiting their ability to assess large language models’ contextual reasoning in authentic professional settings. This work proposes an automated framework grounded in expert practice guidelines and Bloom’s taxonomy, which generates implicit violation scenarios spanning four cognitive levels to construct reproducible and scalable multiple-choice and multi-turn dialogue benchmarks. By integrating natural language generation, cognitive-level mapping, and automatic scoring, the approach enables, for the first time, an end-to-end transformation from domain-specific guidelines into structured evaluation items. Large-scale experiments across education, nutrition, and caregiving domains reveal that large models consistently outperform on higher-order analytical tasks compared to lower-order memory-based ones, thereby uncovering non-intuitive limitations in their contextual reasoning capabilities.

Technology Category

Application Category

πŸ“ Abstract
Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.
Problem

Research questions and friction points this paper is trying to address.

open-ended question answering
contextualized reasoning
LLM evaluation
practice-based domains
benchmark generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

automated benchmark generation
Bloom's Taxonomy
contextualized reasoning
violation-based scenarios
psychometrically informed evaluation
Si Chen
Si Chen
University of Notre Dame
Human-Computer InteractionAI in EducationAccessible Computing
L
Le Khiem
University of Notre Dame, USA
A
Annalisa Szymanski
University of Notre Dame, USA
R
Ronald A. Metoyer
University of Notre Dame, USA
Ting Hua
Ting Hua
University of Notre Dame
Efficient learningCompressionReasoning
N
Nitesh V. Chawla
University of Notre Dame, USA