🤖 AI Summary
To address modality bias induced by illumination variations in infrared and visible-light image fusion, this paper proposes MoCTEFuse, a dynamic multi-level fusion network. Methodologically, it introduces an illumination-gated Mixture-of-Experts (MoE) architecture and a dual-branch Chiral Transformer fusion block, which enables modality-adaptive switching and weight allocation via asymmetric cross-attention. Additionally, it incorporates multi-level feature aggregation and a novel illumination-distribution-aware multi-level competitive loss function. Extensive experiments on the DroneVehicle and MSRS benchmarks demonstrate substantial improvements in fusion quality. In downstream object detection tasks, MoCTEFuse achieves 70.93% mAP on MFNet and 45.14% mAP on DroneVehicle—marking significant gains over state-of-the-art methods. The core contributions lie in (i) the illumination-aware dynamic gating mechanism for modality-specific feature modulation, (ii) the Chiral Transformer with asymmetric cross-attention for robust cross-modal interaction, and (iii) the hierarchical competitive loss that explicitly models illumination distribution across scales.
📝 Abstract
While illumination changes inevitably affect the quality of infrared and visible image fusion, many outstanding methods still ignore this factor and directly merge the information from source images, leading to modality bias in the fused results. To this end, we propose a dynamic multi-level image fusion network called MoCTEFuse, which applies an illumination-gated Mixture of Chiral Transformer Experts (MoCTE) to adaptively preserve texture details and object contrasts in balance. MoCTE consists of high- and low-illumination expert subnetworks, each built upon the Chiral Transformer Fusion Block (CTFB). Guided by the illumination gating signals, CTFB dynamically switches between the primary and auxiliary modalities as well as assigning them corresponding weights with its asymmetric cross-attention mechanism. Meanwhile, it is stacked at multiple stages to progressively aggregate and refine modality-specific and cross-modality information. To facilitate robust training, we propose a competitive loss function that integrates illumination distributions with three levels of sub-loss terms. Extensive experiments conducted on the DroneVehicle, MSRS, TNO and RoadScene datasets show MoCTEFuse's superior fusion performance. Finally, it achieves the best detection mean Average Precision (mAP) of 70.93% on the MFNet dataset and 45.14% on the DroneVehicle dataset. The code and model are released at https://github.com/Bitlijinfu/MoCTEFuse.