🤖 AI Summary
This work addresses a critical security vulnerability in multi-sensor fusion systems for autonomous driving, which rely on cross-modal consistency assumptions and are thus susceptible to coordinated spoofing attacks. The authors propose a novel attack method that simultaneously generates visually plausible image patches and spatially aligned synthetic LiDAR point clusters, creating consistent false objects across camera and LiDAR modalities at the data level. This coherence deceives both sensors into producing congruent erroneous detections, thereby evading the redundancy checks inherent in fusion pipelines. Large-scale simulations on the KITTI dataset demonstrate an 85.5% attack success rate across 400 scenarios, exposing a fundamental flaw in the cross-modal consistency logic underpinning current perception systems.
📝 Abstract
Autonomous Vehicles (AVs) increasingly depend on Multi-Sensor Fusion (MSF) to combine complementary modalities such as cameras and LiDAR for robust perception. While this redundancy is intended to safeguard against single-sensor failures, the fusion process itself introduces a subtle and underexplored vulnerability. In this work, we investigate whether an attacker can bypass MSF's redundancy by fabricating cross-sensor consistency, making multiple sensors agree on the same false object. We design a coordinated, data-level (early-fusion) attack that emulates the outcome of two synchronized physical spoofing sources: an infrared (IR) projection that induces a false camera detection and a LiDAR signal injection that produces a matching 3D point cluster. Rather than implementing the physical attack hardware, we simulate its sensor-level outcomes by inserting perspective-aware image patches and synthetic LiDAR point clusters aligned in 3D space. This approach preserves the perceptual effects that real IR and IEMI-based spoofing would create at the sensor output. Using 400 KITTI scenes, our large-scale evaluation shows that the coordinated spoofing deceives a state-of-the-art perception model with an 85.5% successful attack rate. These findings provide the first quantitative evidence that malicious cross-modal consistency can compromise MSF-based perception, revealing a critical vulnerability in the core data-fusion logic of modern autonomous vehicle systems.