TRAP: Tail-aware Ranking Attack for World-Model Planning

πŸ“… 2026-05-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

224K/year
πŸ€– AI Summary
This work identifies a novel security vulnerability in world models used for long-horizon planning: their trajectory-ranking architecture is susceptible to backdoor attacks, despite conventional attacks failing to effectively disrupt decision-making. The study reveals, for the first time, the long-tailed distribution inherent in trajectory rankings and introduces a tail-aware backdoor attack framework. By perturbing the ranking of critical imagined trajectories, the method hijacks planning outcomes while preserving normal behavior on clean inputs. The approach integrates a tail-aware ranking loss with a dual-gating mechanism to precisely control the timing and intensity of attack activation. Experiments on DreamerV3 and TD-MPC2 demonstrate sustained behavioral deviations and significant performance degradation, exposing a previously unrecognized threat to the safety and reliability of world model–based planning systems.
πŸ“ Abstract
World models enable long-horizon planning by internally generating and evaluating imagined trajectories, making them a promising foundation for generalist agents. However, this imagination-driven decision process also introduces new security risks. Existing backdoor attacks typically aim to manipulate local features, one-step predictions, or instantaneous policy outputs. While such objectives may suffice for weaker reactive models, they are often ineffective against world models, where the learned dynamics prior and planning process can absorb or wash out the effects of shallow perturbations. More importantly, we find that world models exhibit a distinct backdoor vulnerability rooted in the long-tailed ranking structure of imagined trajectories, where disrupting the ordering of a few decision-critical trajectories can systematically hijack planning. To exploit this vulnerability, we propose TRAP, a backdoor attack framework for world models that targets imagined trajectory ranking. TRAP combines a tail-aware ranking loss to focus optimization on decision-critical trajectories with dual gating mechanisms that stabilize optimization and regulate when and where the attack penalty is applied. Under trigger conditions, TRAP alters the relative ranking of imagined trajectories to redirect planning outcomes, while largely maintaining the normal ranking structure on clean inputs. Experiments on DreamerV3 and TD-MPC2 across diverse tasks show that TRAP consistently induces sustained behavioral deviations and significant performance degradation, highlighting the need for dedicated security evaluation of world-model-based agents.
Problem

Research questions and friction points this paper is trying to address.

world models
backdoor attack
trajectory ranking
long-horizon planning
security vulnerability
Innovation

Methods, ideas, or system contributions that make the work stand out.

world models
backdoor attack
trajectory ranking
tail-aware optimization
planning hijacking