WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing methods struggle to efficiently and accurately evaluate the adversarial robustness of world model agents: manual tuning tends to overestimate robustness, while exhaustive search is infeasible due to the high computational cost of closed-loop rollouts. This work proposes WMAttack, a framework that formulates adversarial evaluation as a budget-constrained attack configuration search problem. It introduces Self-Correcting Attack Search (SCAS) to dynamically optimize the attack proposal distribution and integrates Representation-Guided Attack Retrieval (RGAR) to enable cross-task transfer of attack configurations. By leveraging a multidimensional feedback mechanism—encompassing reward degradation, action instability, runtime overhead, and rollout variability—alongside task representation similarity, WMAttack efficiently reuses historical attack strategies. Experiments on Atari and DeepMind Control benchmarks demonstrate significant improvements over baselines, increasing DreamerV3’s normalized reward drop from 0.497 to 1.034 on Atari and from 0.319 to 0.682 on DMC.

📝 Abstract

Despite the growing use of world models as decision-making agents, their adversarial robustness remains underexplored due to the lack of dedicated automated evaluation methods. A key obstacle is that attack evaluation must be both accurate and efficient: weak manually tuned attacks can overestimate robustness, while exhaustive hyperparameter search is prohibitively expensive because each candidate requires closed-loop rollouts through learned latent dynamics. We introduce WMAttack, an automated attack-search framework for adversarial evaluation of world-model agents. WMAttack formulates robustness evaluation as a finite-budget search over attack configurations, including attack families, perturbation budgets, optimization steps, restarts, and allocation rules. To improve search accuracy, Self-Correcting Attack Search (SCAS) refines the attack proposal distribution using feedback from reward degradation, action instability, runtime cost, and rollout variability. To improve search efficiency, Representation-Guided Attack Retrieval (RGAR) retrieves effective historical configurations from representation-similar tasks, providing a warm start for unseen environments. We provide a theoretical explanation showing that proposal refinement improves finite-budget search when it shifts probability mass toward high-utility attacks. Across Atari and DeepMind Control tasks, WMAttack consistently discovers stronger attacks than the evaluated baselines, improving normalized reward drop from 0.497 to 1.034 on DreamerV3 Atari and from 0.319 to 0.682 on DMC. Ablations further show that RGAR improves initial candidate quality and SCAS improves final attack utility under fixed evaluation budgets.

Problem

Research questions and friction points this paper is trying to address.

adversarial robustness

world-model agents

automated attack search

evaluation efficiency

attack configuration

Innovation

Methods, ideas, or system contributions that make the work stand out.

WMAttack

adversarial robustness

world-model agents