🤖 AI Summary
This study investigates the capabilities and human–machine disparities of large language models (LLMs) in text-driven question generation. To this end, we propose the first systematic, multi-dimensional quantitative evaluation framework, automatically assessing LLM-generated questions along six dimensions: question length, question-type distribution, context coverage, answerability, linguistic fluency, and semantic fidelity. We further introduce a novel LLM-based self-evaluation mechanism—requiring no human annotation—that enables end-to-end quality assessment of generated questions. Experimental results reveal a fundamental trade-off in LLM-generated questions between contextual coverage breadth and answer precision—a pattern markedly distinct from human question authoring behavior. Our framework establishes a transferable analytical paradigm for question quality assessment, offering actionable insights for optimizing downstream applications such as question-answering systems and intelligent educational assessment tools.
📝 Abstract
This paper evaluates questions generated by LLMs from context, comparing them to human-generated questions across six dimensions. We introduce an automated LLM-based evaluation method, focusing on aspects like question length, type, context coverage, and answerability. Our findings highlight unique characteristics of LLM-generated questions, contributing insights that can support further research in question quality and downstream applications.