🤖 AI Summary
This work addresses the insufficient robustness of Vision-Language-Action (VLA) models to subtle linguistic perturbations, revealing previously underexplored safety risks. The authors propose the Diversity-Aware Embodied Red-Teaming (DAERT) framework, which introduces diversity constraints into red-teaming for the first time. By leveraging reinforcement learning to optimize a unified policy, DAERT generates both diverse and effective adversarial instructions and evaluates their impact through execution failures in physical simulation environments. This approach mitigates mode collapse commonly observed in conventional red-teaming methods, substantially enhancing test coverage and attack efficacy. Empirical results across multiple robotic benchmarks demonstrate that DAERT reduces the average task success rate of VLA models from 93.33% to 5.85%, comprehensively exposing critical language-based security vulnerabilities.
📝 Abstract
Vision-Language-Action (VLA) models have achieved remarkable success in robotic manipulation. However, their robustness to linguistic nuances remains a critical, under-explored safety concern, posing a significant safety risk to real-world deployment. Red teaming, or identifying environmental scenarios that elicit catastrophic behaviors, is an important step in ensuring the safe deployment of embodied AI agents. Reinforcement learning (RL) has emerged as a promising approach in automated red teaming that aims to uncover these vulnerabilities. However, standard RL-based adversaries often suffer from severe mode collapse due to their reward-maximizing nature, which tends to converge to a narrow set of trivial or repetitive failure patterns, failing to reveal the comprehensive landscape of meaningful risks. To bridge this gap, we propose a novel \textbf{D}iversity-\textbf{A}ware \textbf{E}mbodied \textbf{R}ed \textbf{T}eaming (\textbf{DAERT}) framework, to expose the vulnerabilities of VLAs against linguistic variations. Our design is based on evaluating a uniform policy, which is able to generate a diverse set of challenging instructions while ensuring its attack effectiveness, measured by execution failures in a physical simulator. We conduct extensive experiments across different robotic benchmarks against two state-of-the-art VLAs, including $π_0$ and OpenVLA. Our method consistently discovers a wider range of more effective adversarial instructions that reduce the average task success rate from 93.33\% to 5.85\%, demonstrating a scalable approach to stress-testing VLA agents and exposing critical safety blind spots before real-world deployment.