Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the temporal misalignment problem in audio-visual emotion recognition caused by inconsistent frame rates across modalities. To tackle this issue, the authors propose a Transformer-based multimodal self-attention network that jointly models intra- and inter-modal dependencies within a shared feature space. The approach introduces two key innovations: Temporal Alignment Rotary Position Encoding (TaRoPE) and Cross-Temporal Matching (CTM) loss, which implicitly synchronize audio and visual tokens sampled at heterogeneous rates, thereby enhancing temporal consistency. Experimental results on the CREMA-D and RAVDESS datasets demonstrate that the proposed method significantly outperforms existing baselines, confirming that explicitly accounting for frame rate discrepancies effectively preserves critical temporal cues and improves emotion recognition performance.

Technology Category

Application Category

📝 Abstract

Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework focusing on the temporal alignment of multimodal features. Our design employs a multimodal self-attention encoder that simultaneously captures intra- and inter-modal dependencies within a shared feature space. To address heterogeneous sampling rates, we incorporate Temporally-aligned Rotary Position Embeddings (TaRoPE), which implicitly synchronize audio and video tokens. Furthermore, we introduce a Cross-Temporal Matching (CTM) loss that enforces consistency among temporally proximate pairs, guiding the encoder toward better alignment. Experiments on CREMA-D and RAVDESS datasets demonstrate consistent improvements over recent baselines, suggesting that explicitly addressing frame-rate mismatch helps preserve temporal cues and enhances cross-modal fusion.

Problem

Research questions and friction points this paper is trying to address.

audio-visual emotion recognition

frame-rate mismatch

temporal alignment

multimodal fusion

heterogeneous sampling rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Alignment

Multimodal Self-Attention

TaRoPE