FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of simultaneously preserving spatiotemporal consistency and high-frequency details in infrared and visible video fusion. To this end, the authors propose a frequency-aware fusion framework that decomposes features into high- and low-frequency components for separate modeling. The high-frequency branch captures motion cues and fine details through sparse cross-modal spatiotemporal interactions, while the low-frequency branch enhances robustness to dynamic artifacts such as flickering and jitter via a temporal perturbation strategy. Additionally, an offset-aware temporal consistency constraint is introduced to stabilize inter-frame representations. By jointly integrating frequency decomposition, sparse cross-modal interaction, and temporal perturbation—a combination not previously explored—this method achieves state-of-the-art performance on multiple public benchmarks, significantly improving temporal stability without compromising high-frequency detail preservation.
📝 Abstract
Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at https://github.com/ixilai/FTPFusion.
Problem

Research questions and friction points this paper is trying to address.

infrared and visible video fusion
temporal stability
spatial detail preservation
high-frequency details
temporal perturbation
Innovation

Methods, ideas, or system contributions that make the work stand out.

frequency-aware
temporal perturbation
sparse cross-modal interaction
temporal consistency
video fusion
🔎 Similar Papers
No similar papers found.
X
Xilai Li
Foshan University
C
Chusheng Fang
Foshan University
Xiaosong Li
Xiaosong Li
Foshan University
Image fusioncomputer visionpattern recognition