🤖 AI Summary
This work addresses the limitations of existing RGB-based vision models under challenging conditions such as nighttime or fog, where performance degrades significantly, and the impracticality of high-fidelity infrared-visible fusion methods due to computational latency in real-time edge deployment. To overcome these issues, we propose FusionProxy—a lightweight, plug-and-play real-time fusion module that, for the first time, distills a diffusion model into a standalone fusion unit capable of seamless integration into any vision system without requiring joint training. Its core innovation lies in a dual-variance statistical mechanism: pixel-level variance in the original image space provides weighted supervision, while variance within the feature space of a frozen backbone network enables spatially adaptive alignment. Experiments demonstrate that FusionProxy achieves state-of-the-art performance on static recognition tasks and substantially enhances robustness in dynamic scenarios such as closed-loop autonomous driving, delivering efficient real-time inference across hardware ranging from high-end GPUs to commodity devices.
📝 Abstract
Purely RGB-based vision models often fail to provide reliable cues in challenging scenarios such as nighttime and fog, leading to degraded performance and safety risks. Infrared imaging captures heat-emitting sources and provides critical complementary information, but existing high-fidelity fusion methods suffer from prohibitive latency, rendering them impractical for real-time edge deployment. To address this, we propose FusionProxy, a real-time image fusion module designed as a fully independent, plug-and-play component with diffusion level quality. FusionProxy exploits two complementary statistics of a teacher sample ensemble: per-pixel variance in raw image space, used to weight pixel-level supervision, and per-pixel variance inside frozen foundation backbones, used to route feature-level alignment spatially. Once trained, FusionProxy can be directly integrated into any visual perception system without joint optimization. Extensive experiments demonstrate that our method achieves superior performance on static recognition tasks and significantly enhances robustness in dynamic tasks, including closed-loop autonomous driving. Crucially, FusionProxy achieves real-time inference speeds on diverse platforms, from high-end GPUs to commodity hardware, providing a flexible and generalizable solution for all-day perception.