EduVidQA: Generating and Evaluating Long-form Answers to Student Questions based on Lecture Videos

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of automatically answering open-ended questions posed by students about online instructional videos using multimodal large language models (MLLMs), aiming to enhance interactivity and learning efficacy in digital education platforms. Methodologically, we propose a hybrid fine-tuning strategy combining synthetic and real-world data, and conduct systematic benchmarking across six state-of-the-art MLLMs. We introduce EduVidQA—the first long-form generative dataset for instructional video question answering—comprising 5,252 high-quality Q&A pairs, along with a student-preference-driven qualitative evaluation framework. Results show that synthetic data substantially improves models’ capacity for generating lengthy, coherent answers; however, the task remains highly challenging, with existing MLLMs exhibiting notable deficiencies in factual consistency, pedagogical appropriateness, and structural completeness. This work establishes a new benchmark, dataset, and evaluation paradigm for educational NLP.

Technology Category

Application Category

📝 Abstract
As digital platforms redefine educational paradigms, ensuring interactivity remains vital for effective learning. This paper explores using Multimodal Large Language Models (MLLMs) to automatically respond to student questions from online lectures - a novel question answering task of real world significance. We introduce the EduVidQA Dataset with 5252 question-answer pairs (both synthetic and real-world) from 296 computer science videos covering diverse topics and difficulty levels. To understand the needs of the dataset and task evaluation, we empirically study the qualitative preferences of students, which we provide as an important contribution to this line of work. Our benchmarking experiments consist of 6 state-of-the-art MLLMs, through which we study the effectiveness of our synthetic data for finetuning, as well as showing the challenging nature of the task. We evaluate the models using both text-based and qualitative metrics, thus showing a nuanced perspective of the models' performance, which is paramount to future work. This work not only sets a benchmark for this important problem, but also opens exciting avenues for future research in the field of Natural Language Processing for Education.
Problem

Research questions and friction points this paper is trying to address.

Automating student question answering from lecture videos using multimodal models
Creating and evaluating long-form educational answers with synthetic datasets
Benchmarking model performance for educational natural language processing tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLMs generate answers from lecture videos
Synthetic data enhances model fine-tuning process
Multimodal evaluation combines text and qualitative metrics
🔎 Similar Papers
No similar papers found.
S
Sourjyadip Ray
Indian Institute of Technology, Kharagpur
S
Shubham Sharma
Panjab University, Chandigarh
Somak Aditya
Somak Aditya
Assistant Professor, IIT Kharagpur
Knowledge RepresentationCommonsense ReasoningNatural Language ProcessingNatural Language UnderstandingVisual Reasoning
P
Pawan Goyal
Indian Institute of Technology, Kharagpur