Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the limitations of existing RGB-based vision models under challenging conditions such as nighttime or fog, where performance degrades significantly, and the impracticality of high-fidelity infrared-visible fusion methods due to computational latency in real-time edge deployment. To overcome these issues, we propose FusionProxy—a lightweight, plug-and-play real-time fusion module that, for the first time, distills a diffusion model into a standalone fusion unit capable of seamless integration into any vision system without requiring joint training. Its core innovation lies in a dual-variance statistical mechanism: pixel-level variance in the original image space provides weighted supervision, while variance within the feature space of a frozen backbone network enables spatially adaptive alignment. Experiments demonstrate that FusionProxy achieves state-of-the-art performance on static recognition tasks and substantially enhances robustness in dynamic scenarios such as closed-loop autonomous driving, delivering efficient real-time inference across hardware ranging from high-end GPUs to commodity devices.

📝 Abstract

Purely RGB-based vision models often fail to provide reliable cues in challenging scenarios such as nighttime and fog, leading to degraded performance and safety risks. Infrared imaging captures heat-emitting sources and provides critical complementary information, but existing high-fidelity fusion methods suffer from prohibitive latency, rendering them impractical for real-time edge deployment. To address this, we propose FusionProxy, a real-time image fusion module designed as a fully independent, plug-and-play component with diffusion level quality. FusionProxy exploits two complementary statistics of a teacher sample ensemble: per-pixel variance in raw image space, used to weight pixel-level supervision, and per-pixel variance inside frozen foundation backbones, used to route feature-level alignment spatially. Once trained, FusionProxy can be directly integrated into any visual perception system without joint optimization. Extensive experiments demonstrate that our method achieves superior performance on static recognition tasks and significantly enhances robustness in dynamic tasks, including closed-loop autonomous driving. Crucially, FusionProxy achieves real-time inference speeds on diverse platforms, from high-end GPUs to commodity hardware, providing a flexible and generalizable solution for all-day perception.

Problem

Research questions and friction points this paper is trying to address.

RGB-based vision

infrared imaging

real-time fusion

edge deployment

perception robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

real-time fusion

thermal-aware vision

distilled diffusion