CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

📅 2025-03-30

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing video action recognition models suffer from imbalanced spatiotemporal modeling and superficial multimodal fusion, limiting holistic understanding. To address this, we propose CA²ST—a unified framework encompassing two paradigms: vision-only CAST and audio-visual dual-stream CAVA. Its core innovation is the Bottleneck Cross-Attention (B-CA) mechanism, enabling dynamic, layer-wise interaction among spatial, temporal, and audio experts within Transformer architectures. CA²ST integrates dual-stream spatiotemporal modeling, multi-expert collaborative prediction, and end-to-end audio-visual alignment learning. Evaluated on major visual benchmarks—including EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400—CA²ST achieves balanced, state-of-the-art performance. Moreover, on audio-visual benchmarks—UCF-101 and VGG-Sound—it significantly surpasses prior art.

Technology Category

Application Category

📝 Abstract

We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.

Problem

Research questions and friction points this paper is trying to address.

Balances spatial and temporal video understanding

Integrates audio for holistic video recognition

Uses cross-attention for expert information exchange

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stream architecture for spatio-temporal understanding

Bottleneck Cross-Attention enables expert information exchange

Integrates audio expert for holistic video recognition

🔎 Similar Papers

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention