DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Audio-visual saliency prediction aims to jointly model auditory and visual cues to approximate human attention mechanisms, yet faces two key challenges: efficient auditory feature integration and modeling long-range spatiotemporal dependencies. To address these, we propose a three-tier cross-modal alignment framework—local, global, and adaptive—built upon a Dynamic Learnable Token Fusion Block (DLTFB) and an Adaptive Multimodal Fusion Block (AMFB). Our architecture incorporates a multi-scale visual encoder, a Learnable Token Enhancement Block (LTEB), and a raw-waveform audio branch, coupled with a hierarchical multi-decoder structure. Evaluated on six mainstream audio-visual saliency benchmarks, our method achieves state-of-the-art performance in accuracy while significantly accelerating inference speed—demonstrating superior balance between effectiveness and efficiency.

Technology Category

Application Category

📝 Abstract

Audio-visual saliency prediction aims to mimic human visual attention by identifying salient regions in videos through the integration of both visual and auditory information. Although visual-only approaches have significantly advanced, effectively incorporating auditory cues remains challenging due to complex spatio-temporal interactions and high computational demands. To address these challenges, we propose Dynamic Token Fusion Saliency (DFTSal), a novel audio-visual saliency prediction framework designed to balance accuracy with computational efficiency. Our approach features a multi-scale visual encoder equipped with two novel modules: the Learnable Token Enhancement Block (LTEB), which adaptively weights tokens to emphasize crucial saliency cues, and the Dynamic Learnable Token Fusion Block (DLTFB), which employs a shifting operation to reorganize and merge features, effectively capturing long-range dependencies and detailed spatial information. In parallel, an audio branch processes raw audio signals to extract meaningful auditory features. Both visual and audio features are integrated using our Adaptive Multimodal Fusion Block (AMFB), which employs local, global, and adaptive fusion streams for precise cross-modal fusion. The resulting fused features are processed by a hierarchical multi-decoder structure, producing accurate saliency maps. Extensive evaluations on six audio-visual benchmarks demonstrate that DFTSal achieves SOTA performance while maintaining computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

Integrate auditory cues into video saliency prediction effectively

Balance accuracy and computational efficiency in saliency prediction

Capture long-range dependencies and spatial details in audio-visual fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Token Fusion balances accuracy and efficiency

Learnable Token Enhancement emphasizes crucial cues

Adaptive Multimodal Fusion integrates audio-visual features

🔎 Similar Papers

D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching