Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work addresses the temporal misalignment problem in audio-visual emotion recognition caused by inconsistent frame rates across modalities. To tackle this issue, the authors propose a Transformer-based multimodal self-attention network that jointly models intra- and inter-modal dependencies within a shared feature space. The approach introduces two key innovations: Temporal Alignment Rotary Position Encoding (TaRoPE) and Cross-Temporal Matching (CTM) loss, which implicitly synchronize audio and visual tokens sampled at heterogeneous rates, thereby enhancing temporal consistency. Experimental results on the CREMA-D and RAVDESS datasets demonstrate that the proposed method significantly outperforms existing baselines, confirming that explicitly accounting for frame rate discrepancies effectively preserves critical temporal cues and improves emotion recognition performance.

Technology Category

Application Category

📝 Abstract
Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework focusing on the temporal alignment of multimodal features. Our design employs a multimodal self-attention encoder that simultaneously captures intra- and inter-modal dependencies within a shared feature space. To address heterogeneous sampling rates, we incorporate Temporally-aligned Rotary Position Embeddings (TaRoPE), which implicitly synchronize audio and video tokens. Furthermore, we introduce a Cross-Temporal Matching (CTM) loss that enforces consistency among temporally proximate pairs, guiding the encoder toward better alignment. Experiments on CREMA-D and RAVDESS datasets demonstrate consistent improvements over recent baselines, suggesting that explicitly addressing frame-rate mismatch helps preserve temporal cues and enhances cross-modal fusion.
Problem

Research questions and friction points this paper is trying to address.

audio-visual emotion recognition
frame-rate mismatch
temporal alignment
multimodal fusion
heterogeneous sampling rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Alignment
Multimodal Self-Attention
TaRoPE
Cross-Temporal Matching
Audio-Visual Emotion Recognition