MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback

πŸ“… 2024-10-17
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) suffer from outdated knowledge, hallucination, and prompt sensitivity when generating professional medical multiple-choice questions (e.g., USMLE). Method: We propose an expert-guided self-iterative refinement framework integrating clinical-case-driven expert prompt engineering, multi-round LLM self-critique–self-correction, comparative feedback modeling, and LLM-as-Judge automated evaluation metrics. Contribution/Results: This is the first framework enabling end-to-end, expert-aligned question generation with quantitative assessment. It significantly improves question quality and difficulty calibration: human expert satisfaction increases markedly; automated scores achieve high agreement with expert ratings (Spearman ρ > 0.92); and question pass rates rise by 37%.

Technology Category

Application Category

πŸ“ Abstract
Automatic question generation (QG) is essential for AI and NLP, particularly in intelligent tutoring, dialogue systems, and fact verification. Generating multiple-choice questions (MCQG) for professional exams, like the United States Medical Licensing Examination (USMLE), is particularly challenging, requiring domain expertise and complex multi-hop reasoning for high-quality questions. However, current large language models (LLMs) like GPT-4 struggle with professional MCQG due to outdated knowledge, hallucination issues, and prompt sensitivity, resulting in unsatisfactory quality and difficulty. To address these challenges, we propose MCQG-SRefine, an LLM self-refine-based (Critique and Correction) framework for converting medical cases into high-quality USMLE-style questions. By integrating expert-driven prompt engineering with iterative self-critique and self-correction feedback, MCQG-SRefine significantly enhances human expert satisfaction regarding both the quality and difficulty of the questions. Furthermore, we introduce an LLM-as-Judge-based automatic metric to replace the complex and costly expert evaluation process, ensuring reliable and expert-aligned assessments.
Problem

Research questions and friction points this paper is trying to address.

Enhance multiple choice question generation
Improve AI accuracy in medical exams
Reduce expert evaluation costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM self-refine framework
expert-driven prompt engineering
LLM-as-Judge automatic metric
πŸ”Ž Similar Papers
No similar papers found.