π€ AI Summary
Existing open-domain question-answering evaluation benchmarks predominantly rely on scarce human-exam data, limiting their ability to assess large language modelsβ contextual reasoning in authentic professional settings. This work proposes an automated framework grounded in expert practice guidelines and Bloomβs taxonomy, which generates implicit violation scenarios spanning four cognitive levels to construct reproducible and scalable multiple-choice and multi-turn dialogue benchmarks. By integrating natural language generation, cognitive-level mapping, and automatic scoring, the approach enables, for the first time, an end-to-end transformation from domain-specific guidelines into structured evaluation items. Large-scale experiments across education, nutrition, and caregiving domains reveal that large models consistently outperform on higher-order analytical tasks compared to lower-order memory-based ones, thereby uncovering non-intuitive limitations in their contextual reasoning capabilities.
π Abstract
Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.