🤖 AI Summary
Existing infrared and visible image fusion methods struggle to simultaneously accommodate heterogeneous preferences from human vision and machine vision, and lack adaptive alignment capabilities. To address this, this work proposes DPOFusion, a novel framework that introduces Direct Preference Optimization (DPO) into image fusion for the first time. By integrating an Attribute-Aligned Latent Diffusion Model (PALDM) with a Preference-Controlled Latent Diffusion Model (PCLDM), DPOFusion leverages instance-level DPO to enable task-guided, preference-adaptive fusion generation. The method effectively aligns multi-source preferences—including those from human observers, vision-language models, and downstream task networks—achieving state-of-the-art performance in preference alignment accuracy, fusion quality, and transferability to downstream tasks.
📝 Abstract
As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals. Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability.