🤖 AI Summary
This work addresses the limited generalization of visual perception models under domain shifts—such as those caused by day-night variations—by proposing PEPR, a method grounded in the Learning Using Privileged Information (LUPI) paradigm. PEPR leverages event camera data, available only during training, as privileged information to enhance robustness. By introducing a prediction-based regularization mechanism in a shared latent space, it reformulates cross-modal alignment as a predictive task rather than direct feature alignment, thereby preserving the semantic richness of RGB inputs while incorporating the domain-invariant characteristics of event streams. Experimental results demonstrate that the resulting single-modality RGB model significantly outperforms existing alignment-based baselines on both object detection and semantic segmentation tasks under challenging domain-shift scenarios, including extreme day-night transitions, exhibiting superior domain generalization capability.
📝 Abstract
Deep neural networks for visual perception are highly susceptible to domain shift, which poses a critical challenge for real-world deployment under conditions that differ from the training data. To address this domain generalization challenge, we propose a cross-modal framework under the learning using privileged information (LUPI) paradigm for training a robust, single-modality RGB model. We leverage event cameras as a source of privileged information, available only during training. The two modalities exhibit complementary characteristics: the RGB stream is semantically dense but domain-dependent, whereas the event stream is sparse yet more domain-invariant. Direct feature alignment between them is therefore suboptimal, as it forces the RGB encoder to mimic the sparse event representation, thereby losing semantic detail. To overcome this, we introduce Privileged Event-based Predictive Regularization (PEPR), which reframes LUPI as a predictive problem in a shared latent space. Instead of enforcing direct cross-modal alignment, we train the RGB encoder with PEPR to predict event-based latent features, distilling robustness without sacrificing semantic richness. The resulting standalone RGB model consistently improves robustness to day-to-night and other domain shifts, outperforming alignment-based baselines across object detection and semantic segmentation.