🤖 AI Summary
This work addresses the limited transferability of existing physical adversarial attacks, which often overfit to a single surrogate model, while conventional ensemble methods suffer from gradient conflicts within constrained texture spaces, degrading cross-model generalization. To overcome these limitations, the authors propose a Joint Multi-objective Multi-model Optimization Framework (JMOF) that selects an optimal set of surrogate models through quantitative similarity analysis. JMOF employs a dual-level mechanism to jointly suppress output predictions and flatten intermediate feature representations, and introduces an orthogonal gradient alignment strategy to mitigate gradient conflicts. The method achieves, for the first time, a universal physical attack applicable across diverse vision tasks—including object detection, semantic segmentation, and monocular depth estimation—demonstrating significant improvements over state-of-the-art approaches in both simulated and real-world settings, thereby enhancing transferability and robustness against black-box models and offering a new paradigm for evaluating the vulnerability of deployed vision systems.
📝 Abstract
Physical adversarial attacks often overfit single surrogate models and optimization objectives. While ensemble attacks can mitigate this, existing methods struggle with severe gradient conflicts within restricted physical texture spaces, significantly degrading cross-model transferability. To bridge this gap, this paper proposes a Joint Multi-Objective and Multi-Model Optimization Framework (JMOF) that leverages quantitative similarity analysis to select the optimal surrogate model ensemble. Within JMOF, a dual-level mechanism jointly suppresses prediction outputs and flattens intermediate feature distributions, balancing attack efficiency with deep generalization. Additionally, an Orthogonal Gradient Alignment (OGA) strategy resolves cross-model gradient conflicts, transforming mutually repulsive gradients into synergistic optimization directions. Extensive simulated and real-world experiments demonstrate that JMOF outperforms state-of-the-art baselines against diverse black-box detectors. Crucially, JMOF exhibits substantial cross-vision-task generalization, generating attacks capable of simultaneously deceiving object detection and semantic segmentation or monocular depth estimation models. This research advances the generalization limits of physical adversarial attacks, providing a robust framework for evaluating visual AI vulnerabilities in real-world deployments.