🤖 AI Summary
This work addresses the challenges of ghosting and drift in infrared and visible video fusion, which arise from temporal misalignment, geometric rigidity, and error accumulation in diffusion models. The authors reformulate the fusion task as a history-conditioned motion generation problem and propose a spectral filtering framework that implicitly models motion dynamics to circumvent explicit alignment. Key innovations include stable historical guidance, a soft temporal anchoring mechanism, and a decoupled structure-motion adaptive strategy, complemented by a two-stage training scheme and latent space optimization. The method achieves state-of-the-art performance in both fusion quality and temporal consistency, effectively suppressing artifacts and drift.
📝 Abstract
Infrared and visible video fusion is essential for achieving comprehensive perception in dynamic scenes. However, maintaining temporal consistency remains a formidable challenge. Conventional methods relying on optical flow often suffer from geometric rigidity and ghosting artifacts. Moreover, standard diffusion-based fusion models typically operate in a frame-by-frame manner; when extended to autoregressive settings, they lack intrinsic temporal constraints and are prone to severe error accumulation and drifting, where minor artifacts amplify over time. To address these limitations, we propose a drift-resilient video fusion method that reformulates the task as history-conditioned motion generation. We introduce Stabilized History Guidance and Soft Temporal Anchoring to reframe temporal consistency as spectral filtering, implicitly aggregating motion dynamics without rigid alignment. Furthermore, our Decoupled Structure-Motion Adaptation strategy bridges pre-trained priors and structural constraints via two-stage training and latent refinement. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both fusion quality and temporal stability.