🤖 AI Summary
Traditional ISPs convert RAW sensor data to RGB images for human vision, discarding low-level information critical for computer vision tasks such as object detection. This work introduces a novel RAW-domain object detection paradigm that abandons fixed ISP pipelines in favor of a learnable Raw Adaptation Module (RAM). Inspired by human visual processing, RAM employs a parallel multi-path architecture, integrating attention-guided dynamic feature fusion and end-to-end joint optimization to enable task-driven, adaptive RAW preprocessing. Experiments demonstrate that our approach significantly outperforms conventional RGB-based detectors across multiple RAW benchmarks—particularly under challenging low-light and high-dynamic-range conditions—achieving state-of-the-art performance in RAW-domain object detection.
📝 Abstract
Object detection models are typically applied to standard RGB images processed through Image Signal Processing (ISP) pipelines, which are designed to enhance sensor-captured RAW images for human vision. However, these ISP functions can lead to a loss of critical information that may be essential in optimizing for computer vision tasks, such as object detection. In this work, we introduce Raw Adaptation Module (RAM), a module designed to replace the traditional ISP, with parameters optimized specifically for RAW object detection. Inspired by the parallel processing mechanisms of the human visual system, RAM departs from existing learned ISP methods by applying multiple ISP functions in parallel rather than sequentially, allowing for a more comprehensive capture of image features. These processed representations are then fused in a specialized module, which dynamically integrates and optimizes the information for the target task. This novel approach not only leverages the full potential of RAW sensor data but also enables task-specific pre-processing, resulting in superior object detection performance. Our approach outperforms RGB-based methods and achieves state-of-the-art results across diverse RAW image datasets under varying lighting conditions and dynamic ranges.