Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing open-domain question-answering evaluation benchmarks predominantly rely on scarce human-exam data, limiting their ability to assess large language models’ contextual reasoning in authentic professional settings. This work proposes an automated framework grounded in expert practice guidelines and Bloom’s taxonomy, which generates implicit violation scenarios spanning four cognitive levels to construct reproducible and scalable multiple-choice and multi-turn dialogue benchmarks. By integrating natural language generation, cognitive-level mapping, and automatic scoring, the approach enables, for the first time, an end-to-end transformation from domain-specific guidelines into structured evaluation items. Large-scale experiments across education, nutrition, and caregiving domains reveal that large models consistently outperform on higher-order analytical tasks compared to lower-order memory-based ones, thereby uncovering non-intuitive limitations in their contextual reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.

Problem

Research questions and friction points this paper is trying to address.

open-ended question answering

contextualized reasoning

LLM evaluation

practice-based domains

benchmark generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

automated benchmark generation

Bloom's Taxonomy

contextualized reasoning