Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the temporal misalignment problem in audio-visual emotion recognition caused by inconsistent frame rates across modalities. To tackle this issue, the authors propose a Transformer-based multimodal self-attention network that jointly models intra- and inter-modal dependencies within a shared feature space. The approach introduces two key innovations: Temporal Alignment Rotary Position Encoding (TaRoPE) and Cross-Temporal Matching (CTM) loss, which implicitly synchronize audio and visual tokens sampled at heterogeneous rates, thereby enhancing temporal consistency. Experimental results on the CREMA-D and RAVDESS datasets demonstrate that the proposed method significantly outperforms existing baselines, confirming that explicitly accounting for frame rate discrepancies effectively preserves critical temporal cues and improves emotion recognition performance.

Technology Category

Application Category

📝 Abstract
Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework focusing on the temporal alignment of multimodal features. Our design employs a multimodal self-attention encoder that simultaneously captures intra- and inter-modal dependencies within a shared feature space. To address heterogeneous sampling rates, we incorporate Temporally-aligned Rotary Position Embeddings (TaRoPE), which implicitly synchronize audio and video tokens. Furthermore, we introduce a Cross-Temporal Matching (CTM) loss that enforces consistency among temporally proximate pairs, guiding the encoder toward better alignment. Experiments on CREMA-D and RAVDESS datasets demonstrate consistent improvements over recent baselines, suggesting that explicitly addressing frame-rate mismatch helps preserve temporal cues and enhances cross-modal fusion.
Problem

Research questions and friction points this paper is trying to address.

audio-visual emotion recognition
frame-rate mismatch
temporal alignment
multimodal fusion
heterogeneous sampling rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Alignment
Multimodal Self-Attention
TaRoPE
Cross-Temporal Matching
Audio-Visual Emotion Recognition
🔎 Similar Papers
No similar papers found.
I
Inyong Koo
School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Y
Yeeun Seong
Graduate School of Green Growth and Sustainability, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Minseok Son
Minseok Son
Korea Advanced Institute of Science and Technology
Long-tailed RecognitionFew-shot LearningSemantic Segmentation
J
Jaehyuk Jang
School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Changick Kim
Changick Kim
Korea Advanced Institute of Science and Technology
Computer vision