How VLAs (Really) Work In Open-World Environments

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Current vision-language-action (VLA) models are evaluated in open-world settings primarily based on task completion, often overlooking operational safety and procedural robustness, which leads to an overestimation of their performance. Addressing the BEHAVIOR1K challenge, this work introduces the first multidimensional evaluation protocol that encompasses reproducibility, consistency, safety, and task awareness. By integrating policy replay, safety violation detection, and task-progress-aware analysis, the proposed framework enables a systematic assessment of VLA behaviors. Experimental results reveal significant deficiencies in mainstream models concerning safety and consistency. The new protocol effectively uncovers latent risks that could compromise real-world deployment, thereby establishing a more comprehensive benchmark for future VLA development.

Technology Category

Application Category

📝 Abstract

Vision-language-action models (VLAs) have been extensively used in robotics applications, achieving great success in various manipulation problems. More recently, VLAs have been used in long-horizon tasks and evaluated on benchmarks, such as BEHAVIOR1K (B1K), for solving complex household chores. The common metric for measuring progress in such benchmarks is success rate or partial score based on satisfaction of progress-agnostic criteria, meaning only the final states of the objects are considered, regardless of the events that lead to such states. In this paper, we argue that using such evaluation protocols say little about safety aspects of operation and can potentially exaggerate reported performance, undermining core challenges for future real-world deployment. To this end, we conduct a thorough analysis of state-of-the-art models on the B1K Challenge and evaluate policies in terms of robustness via reproducibility and consistency of performance, safety aspects of policies operations, task awareness, and key elements leading to the incompletion of tasks. We then propose evaluation protocols to capture safety violations to better measure the true performance of the policies in more complex and interactive scenarios. At the end, we discuss the limitations of the existing VLAs and motivate future research.

Problem

Research questions and friction points this paper is trying to address.

Vision-language-action models

open-world environments

safety evaluation

task robustness

performance metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language-action models

safety evaluation

robustness