Efficient Rectified Flow for Image Fusion

๐Ÿ“… 2025-09-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Image fusion requires integrating complementary multimodal information, yet existing diffusion models suffer from prohibitively long inference times and high computational overhead. This paper proposes RFfusion, an efficient single-step diffusion framework built upon Rectified Flow. Its core innovations include: (i) enabling single-step sampling without additional training; (ii) introducing a task-specific variational autoencoder (VAE) that performs cross-modal feature alignment and fusion directly in the latent space; and (iii) adopting a two-stage training strategy to jointly optimize reconstruction fidelity and semantic consistency of fused representations. Evaluated on multiple benchmarks, RFfusion achieves over 100ร— speedup relative to typical diffusion-based methods while significantly outperforming state-of-the-art approaches in fusion quality. It delivers millisecond-level inference latency without sacrificing fine-grained structural detailsโ€”marking the first diffusion-based method to unify efficiency and high-fidelity detail preservation.

Technology Category

Application Category

๐Ÿ“ Abstract
Image fusion is a fundamental and important task in computer vision, aiming to combine complementary information from different modalities to fuse images. In recent years, diffusion models have made significant developments in the field of image fusion. However, diffusion models often require complex computations and redundant inference time, which reduces the applicability of these methods. To address this issue, we propose RFfusion, an efficient one-step diffusion model for image fusion based on Rectified Flow. We incorporate Rectified Flow into the image fusion task to straighten the sampling path in the diffusion model, achieving one-step sampling without the need for additional training, while still maintaining high-quality fusion results. Furthermore, we propose a task-specific variational autoencoder (VAE) architecture tailored for image fusion, where the fusion operation is embedded within the latent space to further reduce computational complexity. To address the inherent discrepancy between conventional reconstruction-oriented VAE objectives and the requirements of image fusion, we introduce a two-stage training strategy. This approach facilitates the effective learning and integration of complementary information from multi-modal source images, thereby enabling the model to retain fine-grained structural details while significantly enhancing inference efficiency. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods in terms of both inference speed and fusion quality. Code is available at https://github.com/zirui0625/RFfusion.
Problem

Research questions and friction points this paper is trying to address.

Diffusion models for image fusion require complex computations and slow inference
Conventional VAEs have objectives misaligned with image fusion requirements
Need to maintain fusion quality while significantly improving inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rectified Flow for one-step diffusion sampling
Task-specific VAE with latent space fusion
Two-stage training strategy for fusion optimization
๐Ÿ”Ž Similar Papers
Z
Zirui Wang
City University of Hong Kong
J
Jiayi Zhang
Dalian University of Technology
T
Tianwei Guan
Chinese University of Hong Kong
Yuhan Zhou
Yuhan Zhou
Ph.D. student, University of North Texas
Data QualityHealth InformaticsData Science
X
Xingyuan Li
Zhejiang University
Minjing Dong
Minjing Dong
Assistant Professor of Computer Science, City University of Hong Kong
Computer VisionAdversarial RobustnessGenerative ModelModel CalibrationEfficient model
J
Jinyuan Liu
Dalian University of Technology