🤖 AI Summary
This work addresses the privacy risks and multimodal alignment challenges inherent in existing RGB-Thermal crowd counting methods, which rely on RGB imagery. We propose the first purely thermal-based crowd counting framework that operates solely on thermal inputs during inference. To enhance thermal feature representation, we leverage a depth-to-RGB diffusion model as a cross-modal prior and introduce a Latent Consistency Model (LCM) single-step denoising strategy to preserve structural information from depth conditioning while avoiding error accumulation in multi-step denoising. Our approach achieves counting accuracy comparable to state-of-the-art RGB-Thermal fusion methods on the RGBT-CC and DroneRGBT benchmarks, while entirely eliminating the privacy concerns associated with continuous RGB image acquisition—marking the first privacy-preserving crowd counting system that requires no RGB input.
📝 Abstract
While RGB-Thermal crowd counting has shown promise, the paradigm faces critical limitations: RGB data raises privacy concerns in public surveillance, and multi-modal misalignment degrades fusion performance. We propose the first thermal-only framework specifically designed for privacy-conscious crowd counting, eliminating RGB dependency at inference time and substantially reducing the privacy exposure associated with continuous RGB capture in public surveillance deployments. To mitigate thermal ambiguity, we leverage depth-to-RGB diffusion models as a cross-modal bridge, extracting discriminative features that enhance thermal representations. Critically, we demonstrate that single-step LCM denoising yields features most faithful to the structural content of the depth conditioning signal, while multi-step approaches progressively decouple features from the conditioning input and accumulate errors that degrade counting accuracy. Experiments on RGBT-CC and DroneRGBT datasets show our method achieves competitive performance against state-of-the-art RGB-T fusion methods, while requiring only thermal input during inference, eliminating the need for continuous RGB capture that constitutes the primary privacy concern in real-world surveillance deployment. The code will be made publicly available.