🤖 AI Summary
This study addresses the challenge of real-time, fine-grained assessment of multidimensional learner engagement—encompassing attention, emotion, and cognition—in self-directed video learning. The authors propose a multimodal learning engagement estimation system that integrates physiological and behavioral signals from wearable sensors (PPG, ECG, EDA, EEG, IMU) and camera-based eye tracking, augmented with momentary self-report probes. Key contributions include the release of EduGage, the first synchronously aligned multimodal dataset for learning engagement, and empirical validation of lightweight multimodal fusion for fine-grained modeling. Experimental results demonstrate that the proposed model achieves a mean absolute error (MAE) of 0.81, a within-1 accuracy of 83.75%, and a binary classification accuracy of 73.93% under cross-validation, significantly outperforming multiple baseline approaches.
📝 Abstract
Engagement, which links to attentional, emotional, and cognitive dimensions, plays an important role in learning. In online and video-based learning environments, learners often need to regulate their own interactions with instructional materials. Measuring and reflecting on engagement can therefore support both learners and adaptive learning systems. In this study, we use wearable and camera-based sensing devices to collect physiological and motion signals, including PPG, ECG, EDA, EEG, IMU, heart rate, temperature, and eye-tracking data, to estimate learner engagement. We conducted a user study with 16 participants in a video-based learning scenario, where participants completed learning tasks and provided repeated in-situ self-reports of engagement through brief probes. We develop and evaluate a system for engagement estimation, compare different sensing modalities, and further analyze the feasibility and effectiveness of multimodal modeling for characterizing learner engagement. Across participant-based cross-validation, our model achieves an MAE of 0.81, 83.75% within-1 accuracy, 73.93% binary accuracy, and 68.45% binary Macro-F1, outperforming sensor-free, statistical, deep temporal, foundation-model, and LLM-based baselines. Our results suggest that fine-grained engagement estimation is feasible but inherently noisy, and that practical systems should prioritize lightweight combinations of behavioral and physiological signals over full multimodal instrumentation. We release the EduGage dataset, including synchronized multimodal sensor signals, probe-aligned momentary engagement labels, video metadata, quizzes, and study materials, to support reproducible research on fine-grained sensor-based engagement modeling in self-guided learning.