🤖 AI Summary
This study addresses the need for pedagogically appropriate reading comprehension materials in morphologically rich languages—specifically Portuguese—for elementary education. We propose a generative AI–driven framework for automatically constructing and psychometrically evaluating narrative-oriented, multi-level multiple-choice questions (MCQs). Methodologically, the framework integrates large language model–based item generation, narrative structure annotation, linguistic-feature–informed difficulty quantification, expert human evaluation, and Item Response Theory (IRT) analysis, validated using real student response data. Contributions include: (1) the first systematic reliability assessment of generative AI–produced educational MCQs in morphologically complex languages; (2) a novel dual-constraint generation mechanism jointly guided by narrative dimensions and difficulty stratification; and (3) empirical evidence demonstrating that generated items exhibit acceptable discrimination and difficulty parameters—comparable to human-authored items—though further refinement is needed in semantic clarity and distractor design.
📝 Abstract
While MCQs are valuable for learning and evaluation, manually creating them with varying difficulty levels and targeted reading skills remains a time-consuming and costly task. Recent advances in generative AI provide an opportunity to automate MCQ generation efficiently. However, assessing the actual quality and reliability of generated MCQs has received limited attention -- particularly regarding cases where generation fails. This aspect becomes particularly important when the generated MCQs are meant to be applied in real-world settings. Additionally, most MCQ generation studies focus on English, leaving other languages underexplored. This paper investigates the capabilities of current generative models in producing MCQs for reading comprehension in Portuguese, a morphologically rich language. Our study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels. We evaluate these MCQs through expert review and by analyzing the psychometric properties extracted from student responses to assess their suitability for elementary school students. Our results show that current models can generate MCQs of comparable quality to human-authored ones. However, we identify issues related to semantic clarity and answerability. Also, challenges remain in generating distractors that engage students and meet established criteria for high-quality MCQ option design.