🤖 AI Summary
This work addresses the challenge of accurately predicting student collaborative satisfaction in game-based learning, where unimodal cues—such as eye-tracking—are prone to degradation and insufficient on their own. To overcome this limitation, the authors propose the AAMLA framework, which incorporates a Cross-modal Affinity Modeling and Alignment (CAMA) module to explicitly capture inter-modal affinities, enhances semantic consistency through contrastive learning, and adaptively aligns and fuses multimodal features—including facial action units, head pose, eye movements, and interaction logs—within a unified embedding space. A key innovation is the cross-modal affinity-guided mechanism that adaptively suppresses interference from low-quality modalities while preserving their information, thereby improving model robustness and interpretability. Evaluated with 50 middle-school students in the EcoJourneys environment, AAMLA outperforms both unimodal baselines and existing cross-attention approaches under both standard and modality-degraded conditions, with SHAP and t-SNE analyses confirming the effectiveness of its learned representations.
📝 Abstract
Collaborative game-based learning environments offer rich opportunities for small-group knowledge construction, yet automatically predicting student collaboration satisfaction remains challenging. A critical barrier is modality degradation: in educational deployments, individual modalities such as eye gaze exhibit inconsistent informativeness across student cohorts, causing implicit attention-based fusion to produce brittle multimodal representations. We propose the Affinity-Aligned Multimodal Learning Analytics (AAMLA) framework, whose core contribution is the Cross-modal Affinity-guided Modality Alignment (CAMA) module, which explicitly models inter-modal relationships via affinity matrices and enforces cross-modal consistency through contrastive learning, enabling adaptive suppression of uninformative modalities without discarding them. AAMLA further applies modality-specific projection layers to map heterogeneous features, including facial action units, head pose, eye gaze, and interaction trace logs, into a unified semantic space prior to alignment. Experiments on 50 middle school students in the EcoJourneys collaborative learning environment demonstrate consistent improvements over unimodal baselines and prior cross-attention approaches under standard and modality degradation conditions, with SHAP and t-SNE analyses confirming that CAMA produces robust, interpretable cross-modal representations for student collaboration modeling.