🤖 AI Summary
This work addresses the robustness evaluation of bird’s-eye-view (BEV) 3D object detectors by proposing the first spatiotemporally consistent, physically realizable general-purpose 3D adversarial object generation method. To meet the practical requirement of persistent multi-view and multi-frame attacks, we design an occlusion-aware differentiable rendering module and incorporate BEV feature-space constraints alongside multi-view geometric consistency modeling—ensuring stable interference and visual plausibility across views and frames. The generated non-intrusive 3D perturbations (e.g., vehicle-mounted thin films) significantly degrade vehicle detection performance on mainstream BEV detectors, achieving higher attack success rates than existing baselines. Moreover, the method exhibits strong generalization: adversarial objects remain effective under varying positions, distances, and viewing angles, without requiring re-optimization. This establishes a new benchmark for physically grounded, temporally coherent adversarial evaluation of BEV perception systems.
📝 Abstract
3D object detection is a critical component in autonomous driving systems. It allows real-time recognition and detection of vehicles, pedestrians and obstacles under varying environmental conditions. Among existing methods, 3D object detection in the Bird's Eye View (BEV) has emerged as the mainstream framework. To guarantee a safe, robust and trustworthy 3D object detection, 3D adversarial attacks are investigated, where attacks are placed in 3D environments to evaluate the model performance, e.g. putting a film on a car, clothing a pedestrian. The vulnerability of 3D object detection models to 3D adversarial attacks serves as an important indicator to evaluate the robustness of the model against perturbations. To investigate this vulnerability, we generate non-invasive 3D adversarial objects tailored for real-world attack scenarios. Our method verifies the existence of universal adversarial objects that are spatially consistent across time and camera views. Specifically, we employ differentiable rendering techniques to accurately model the spatial relationship between adversarial objects and the target vehicle. Furthermore, we introduce an occlusion-aware module to enhance visual consistency and realism under different viewpoints. To maintain attack effectiveness across multiple frames, we design a BEV spatial feature-guided optimization strategy. Experimental results demonstrate that our approach can reliably suppress vehicle predictions from state-of-the-art 3D object detectors, serving as an important tool to test robustness of 3D object detection models before deployment. Moreover, the generated adversarial objects exhibit strong generalization capabilities, retaining its effectiveness at various positions and distances in the scene.