🤖 AI Summary
This work addresses the challenge of automatically answering open-ended questions posed by students about online instructional videos using multimodal large language models (MLLMs), aiming to enhance interactivity and learning efficacy in digital education platforms. Methodologically, we propose a hybrid fine-tuning strategy combining synthetic and real-world data, and conduct systematic benchmarking across six state-of-the-art MLLMs. We introduce EduVidQA—the first long-form generative dataset for instructional video question answering—comprising 5,252 high-quality Q&A pairs, along with a student-preference-driven qualitative evaluation framework. Results show that synthetic data substantially improves models’ capacity for generating lengthy, coherent answers; however, the task remains highly challenging, with existing MLLMs exhibiting notable deficiencies in factual consistency, pedagogical appropriateness, and structural completeness. This work establishes a new benchmark, dataset, and evaluation paradigm for educational NLP.
📝 Abstract
As digital platforms redefine educational paradigms, ensuring interactivity remains vital for effective learning. This paper explores using Multimodal Large Language Models (MLLMs) to automatically respond to student questions from online lectures - a novel question answering task of real world significance. We introduce the EduVidQA Dataset with 5252 question-answer pairs (both synthetic and real-world) from 296 computer science videos covering diverse topics and difficulty levels. To understand the needs of the dataset and task evaluation, we empirically study the qualitative preferences of students, which we provide as an important contribution to this line of work. Our benchmarking experiments consist of 6 state-of-the-art MLLMs, through which we study the effectiveness of our synthetic data for finetuning, as well as showing the challenging nature of the task. We evaluate the models using both text-based and qualitative metrics, thus showing a nuanced perspective of the models' performance, which is paramount to future work. This work not only sets a benchmark for this important problem, but also opens exciting avenues for future research in the field of Natural Language Processing for Education.