🤖 AI Summary
This work identifies a novel security vulnerability in generative world models used for synthesizing controllable driving videos, wherein physical condition channels—such as those encoding high-definition maps or 3D bounding boxes—are susceptible to adversarial manipulation. To exploit this vulnerability, the authors propose PhysCond-WMA, the first white-box adversarial attack tailored for world models. The method employs a two-stage optimization strategy: a quality-preserving guidance phase that constrains reverse diffusion loss, followed by a momentum-guided denoising phase that accumulates gradients aligned with the attack objective. Despite maintaining high visual fidelity—evidenced by only a 9% increase in FID and a 3.9% rise in FVD—the attack achieves a 55% success rate in steering model outputs toward targeted manipulations. Critically, it degrades downstream 3D object detection performance by 4% and open-loop planning accuracy by 20%, thereby systematically exposing and quantifying previously uncharacterized safety risks in generative world models.
📝 Abstract
Generative world models (WMs) are increasingly used to synthesize controllable, sensor-conditioned driving videos, yet their reliance on physical priors exposes novel attack surfaces. In this paper, we present Physical-Conditioned World Model Attack (PhysCond-WMA), the first white-box world model attack that perturbs physical-condition channels, such as HDMap embeddings and 3D-box features, to induce semantic, logic, or decision-level distortion while preserving perceptual fidelity. PhysCond-WMA is optimized in two stages: (1) a quality-preserving guidance stage that constrains reverse-diffusion loss below a calibrated threshold, and (2) a momentum-guided denoising stage that accumulates target-aligned gradients along the denoising trajectory for stable, temporally coherent semantic shifts. Extensive experimental results demonstrate that our approach remains effective while increasing FID by about 9% on average and FVD by about 3.9% on average. Under the targeted attack setting, the attack success rate (ASR) reaches 0.55. Downstream studies further show tangible risk, which using attacked videos for training decreases 3D detection performance by about 4%, and worsens open-loop planning performance by about 20%. These findings has for the first time revealed and quantified security vulnerabilities in generative world models, driving more comprehensive security checkers.