VideoFusion: A Spatio-Temporal Collaborative Network for Mutli-modal Video Fusion and Restoration

📅 2025-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of large-scale benchmarks and the difficulty of spatiotemporal joint modeling in multi-sensor video fusion, this paper introduces M3SVD—the first large-scale, synchronized, and precisely registered infrared–visible light video dataset—and proposes an end-to-end collaborative network. Methodologically, it pioneers a differential reinforcement module to enhance cross-modal discriminative representation; designs a modality-guided fusion strategy for selective aggregation of complementary information; and incorporates a bidirectional temporal co-attention mechanism to jointly model spatiotemporal dependencies. The network achieves deep integration of cross-modal feature interaction, multi-scale fusion, and video-level spatiotemporal optimization. Extensive experiments on M3SVD demonstrate that our approach significantly outperforms image-level fusion methods, effectively mitigating inter-frame inconsistency and cross-modal interference while improving structural fidelity and motion coherence. This work establishes a new paradigm for multimodal video restoration.

Technology Category

Application Category

📝 Abstract
Compared to images, videos better align with real-world acquisition scenarios and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos. This primarily stems from two factors: 1) the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion, and 2) the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. This paper proactively compensates for the dilemmas. First, we construct M3SVD, a benchmark dataset with $220$ temporally synchronized and spatially registered infrared-visible video pairs comprising 153,797 frames, filling the data gap for the video fusion community. Secondly, we propose VideoFusion, a multi-modal video fusion model that fully exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from (potentially degraded) multi-modal inputs. Specifically, 1) a differential reinforcement module is developed for cross-modal information interaction and enhancement, 2) a complete modality-guided fusion strategy is employed to adaptively integrate multi-modal features, and 3) a bi-temporal co-attention mechanism is devised to dynamically aggregate forward-backward temporal contexts to reinforce cross-frame feature representations. Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequential scenarios, effectively mitigating temporal inconsistency and interference.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale multi-sensor video datasets for fusion research
Difficulty in modeling spatial-temporal dependencies in video fusion
Need for coherent multi-modal video fusion from degraded inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs M3SVD benchmark dataset for video fusion
Proposes VideoFusion model for multi-modal video fusion
Uses bi-temporal co-attention for temporal context aggregation
L
Linfeng Tang
Wuhan University
Y
Yeda Wang
Wuhan University
Meiqi Gong
Meiqi Gong
Wuhan University
Image processing
Zizhuo Li
Zizhuo Li
Wuhan University
Computer VisionImage MatchingMulti-View Geometry
Y
Yuxin Deng
Wuhan University
Xunpeng Yi
Xunpeng Yi
Wuhan University
Computer Vision
C
Chunyu Li
Wuhan University
H
Han Xu
Southeast University
H
Hao Zhang
Wuhan University
Jiayi Ma
Jiayi Ma
Wuhan University
Computer VisionImage FusionImage Matching