🤖 AI Summary
Automated assessment of question quality in education suffers from low accuracy, limited evaluation dimensions, and poor agreement with human judgments. To address these challenges, this paper proposes a novel two-stage iterative evaluation framework leveraging collaborative large language models (LLMs). In Stage I, structured prompting orchestrates multi-agent specialization to generate multidimensional analyses—e.g., relevance, appropriateness, and cognitive demand. In Stage II, a dynamic feedback mechanism iteratively refines evaluations until convergence across quality metrics. This framework introduces the pioneering “reason-and-refine” paradigm, integrating multi-model cross-verification, adaptive optimization, and comprehensive multidimensional quality modeling. Experimental results demonstrate significant improvements over baseline methods: Pearson correlation with human scores increases by 23.6%, error distributions become markedly more concentrated, and robustness against input perturbations is substantially enhanced.
📝 Abstract
Automatically assessing question quality is crucial for educators as it saves time, ensures consistency, and provides immediate feedback for refining teaching materials. We propose a novel methodology called STRIVE (Structured Thinking and Refinement with multiLLMs for Improving Verified Question Estimation) using a series of Large Language Models (LLMs) for automatic question evaluation. This approach aims to improve the accuracy and depth of question quality assessment, ultimately supporting diverse learners and enhancing educational practices. The method estimates question quality in an automated manner by generating multiple evaluations based on the strengths and weaknesses of the provided question and then choosing the best solution generated by the LLM. Then the process is improved by iterative review and response with another LLM until the evaluation metric values converge. This sophisticated method of evaluating question quality improves the estimation of question quality by automating the task of question quality evaluation. Correlation scores show that using this proposed method helps to improve correlation with human judgments compared to the baseline method. Error analysis shows that metrics like relevance and appropriateness improve significantly relative to human judgments by using STRIVE.