🤖 AI Summary
Existing video multimodal fusion methods naively adapt static image fusion techniques, neglecting temporal dependencies and resulting in inter-frame inconsistency. To address this, we propose the first temporal modeling and vision–semantics co-learning framework specifically designed for video fusion. Our method introduces a vision–semantics interaction module and a temporal coordination module, implemented via dual distillation branches—DINOv2 for semantic representation and VGG19 for low-level visual features. We further incorporate a temporal enhancement mechanism, a temporal consistency loss, and a dedicated evaluation metric suite. Extensive experiments demonstrate that our approach significantly improves weak-information recovery and dynamic coherence, achieving state-of-the-art performance across multiple public video datasets. It attains superior results in all three critical dimensions: visual fidelity, semantic accuracy, and temporal continuity. The source code is publicly available.
📝 Abstract
Existing multi-modal fusion methods typically apply static frame-based image fusion techniques directly to video fusion tasks, neglecting inherent temporal dependencies and leading to inconsistent results across frames. To address this limitation, we propose the first video fusion framework that explicitly incorporates temporal modeling with visual-semantic collaboration to simultaneously ensure visual fidelity, semantic accuracy, and temporal consistency. First, we introduce a visual-semantic interaction module consisting of a semantic branch and a visual branch, with Dinov2 and VGG19 employed for targeted distillation, allowing simultaneous enhancement of both the visual and semantic representations. Second, we pioneer integrate the video degradation enhancement task into the video fusion pipeline by constructing a temporal cooperative module, which leverages temporal dependencies to facilitate weak information recovery. Third, to ensure temporal consistency, we embed a temporal-enhanced mechanism into the network and devise a temporal loss to guide the optimization process. Finally, we introduce two innovative evaluation metrics tailored for video fusion, aimed at assessing the temporal consistency of the generated fused videos. Extensive experimental results on public video datasets demonstrate the superiority of our method. Our code is released at https://github.com/Meiqi-Gong/TemCoCo.