🤖 AI Summary
This work addresses the significant performance degradation of conventional RGB-based object detection under extremely low-light conditions. To overcome this limitation, the authors propose a dual-stream fusion framework that integrates CLAHE-enhanced RGB images with voxelized event data. Central to their approach is an adaptive cross-modal attention mechanism grounded in minimum-variance linear estimation theory, which asymptotically approximates the Gauss–Markov optimal fusion weights. The study further establishes, for the first time, theoretical bounds relating the conservation properties of event voxelization to its temporal resolution. Evaluated on the LLE-VOS benchmark, the method achieves 65.54% recall, 53.85% precision, and 59.12% F1-score, substantially outperforming single-modality approaches and demonstrating both its illumination-adaptive capability and theoretical soundness.
📝 Abstract
Detecting objects reliably under extreme low-light conditions is an open problem in computer vision, with practical urgency in applications ranging from nighttime surveillance to search-and-rescue robotics. Conventional RGB cameras degrade sharply at low photon flux, while event cameras which record asynchronous per-pixel brightness changes at microsecond resolution and high dynamic range provide complementary structural cues that are largely illumination-invariant. We present AdaFuse-Det, a dual-stream framework that fuses CLAHE-enhanced RGB frames with voxelized event tensors through an Adaptive Cross-Modal Fusion (ACMF) module grounded in minimum-variance linear estimation theory. We formally show that the learned attention map asymptotically recovers the Gauss-Markov optimal fusion weights, and establish event conservation and temporal resolution bounds for the voxelization stage. On the LLE-VOS benchmark, AdaFuse-Det achieves a Recall of $65.54\%$, Precision of $53.85\%$, and F1-Score of $59.12\%$ under severe illumination degradation, outperforming single-modality detectors in recall by a margin that reflects the theoretically predicted illumination-adaptation behavior.