Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This study addresses the high cost and limited scalability of manual annotation for classroom interactions by proposing a multimodal AI framework for fine-grained, automated recognition of teaching activities (24 classes) and teacher-student discourse (19 classes). Methodologically, it introduces parallel, modality-specific pipelines for video and text processing, integrating contextual window modeling, class-balanced sampling, and multi-label threshold optimization. The framework employs fine-tuned vision-language models, self-supervised video Transformers, and contextualized Transformer classifiers, with ablation against zero-shot large language model prompting. Results show that fine-tuned models significantly outperform prompting: macro-F1 scores reach 0.577 (video) and 0.460 (text), demonstrating the feasibility of scalable, automated classroom feedback systems. The core contribution is the first end-to-end, multimodal, fine-grained recognition framework tailored to instructional settings, coupled with robust training strategies for imbalanced, multimodal educational data.

Technology Category

Application Category

📝 Abstract

Observation of classroom interactions can provide concrete feedback to teachers, but current methods rely on manual annotation, which is resource-intensive and hard to scale. This work explores AI-driven analysis of classroom recordings, focusing on multimodal instructional activity and discourse recognition as a foundation for actionable feedback. Using a densely annotated dataset of 164 hours of video and 68 lesson transcripts, we design parallel, modality-specific pipelines. For video, we evaluate zero-shot multimodal LLMs, fine-tuned vision-language models, and self-supervised video transformers on 24 activity labels. For transcripts, we fine-tune a transformer-based classifier with contextualized inputs and compare it against prompting-based LLMs on 19 discourse labels. To handle class imbalance and multi-label complexity, we apply per-label thresholding, context windows, and imbalance-aware loss functions. The results show that fine-tuned models consistently outperform prompting-based approaches, achieving macro-F1 scores of 0.577 for video and 0.460 for transcripts. These results demonstrate the feasibility of automated classroom analysis and establish a foundation for scalable teacher feedback systems.

Problem

Research questions and friction points this paper is trying to address.

Automates recognition of instructional activities from classroom video

Automates recognition of discourse patterns from classroom transcripts

Establishes scalable AI foundation for teacher feedback systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned models outperform prompting-based LLMs

Parallel pipelines process video and transcript data separately

Imbalance-aware techniques handle multi-label classification complexity

🔎 Similar Papers

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review