EduVidQA: Generating and Evaluating Long-form Answers to Student Questions based on Lecture Videos

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of automatically answering open-ended questions posed by students about online instructional videos using multimodal large language models (MLLMs), aiming to enhance interactivity and learning efficacy in digital education platforms. Methodologically, we propose a hybrid fine-tuning strategy combining synthetic and real-world data, and conduct systematic benchmarking across six state-of-the-art MLLMs. We introduce EduVidQA—the first long-form generative dataset for instructional video question answering—comprising 5,252 high-quality Q&A pairs, along with a student-preference-driven qualitative evaluation framework. Results show that synthetic data substantially improves models’ capacity for generating lengthy, coherent answers; however, the task remains highly challenging, with existing MLLMs exhibiting notable deficiencies in factual consistency, pedagogical appropriateness, and structural completeness. This work establishes a new benchmark, dataset, and evaluation paradigm for educational NLP.

Technology Category

Application Category

📝 Abstract

As digital platforms redefine educational paradigms, ensuring interactivity remains vital for effective learning. This paper explores using Multimodal Large Language Models (MLLMs) to automatically respond to student questions from online lectures - a novel question answering task of real world significance. We introduce the EduVidQA Dataset with 5252 question-answer pairs (both synthetic and real-world) from 296 computer science videos covering diverse topics and difficulty levels. To understand the needs of the dataset and task evaluation, we empirically study the qualitative preferences of students, which we provide as an important contribution to this line of work. Our benchmarking experiments consist of 6 state-of-the-art MLLMs, through which we study the effectiveness of our synthetic data for finetuning, as well as showing the challenging nature of the task. We evaluate the models using both text-based and qualitative metrics, thus showing a nuanced perspective of the models' performance, which is paramount to future work. This work not only sets a benchmark for this important problem, but also opens exciting avenues for future research in the field of Natural Language Processing for Education.

Problem

Research questions and friction points this paper is trying to address.

Automating student question answering from lecture videos using multimodal models

Creating and evaluating long-form educational answers with synthetic datasets

Benchmarking model performance for educational natural language processing tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLMs generate answers from lecture videos

Synthetic data enhances model fine-tuning process

Multimodal evaluation combines text and qualitative metrics

🔎 Similar Papers

VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It