🤖 AI Summary
This work addresses the challenges of fragmented interfaces and communication latency that commonly hinder the deployment of Vision-Language-Action (VLA) models in robotic systems. To overcome these limitations, the authors propose a modular policy server that encapsulates VLA inference through a unified Gymnasium-style interface and introduces, for the first time, a context-aware communication mechanism that adaptively switches between modes: zero-copy shared memory for local execution to accelerate simulation, and compressed streaming for remote operation to reduce bandwidth overhead. The design is compatible with seven mainstream policies, including OpenVLA and Pi Zero, and consistently outperforms the default servers of OpenVLA, OpenPi, and LeRobot in both local and remote benchmarks, significantly enhancing the deployment efficiency and generalizability of VLA systems.
📝 Abstract
The rapid emergence of Vision-Language-Action models (VLAs) has a significant impact on robotics. However, their deployment remains complex due to the fragmented interfaces and the inherent communication latency in distributed setups. To address this, we introduce VLAgents, a modular policy server that abstracts VLA inferencing behind a unified Gymnasium-style protocol. Crucially, its communication layer transparently adapts to the context by supporting both zero-copy shared memory for high-speed simulation and compressed streaming for remote hardware. In this work, we present the architecture of VLAgents and validate it by integrating seven policies -- including OpenVLA and Pi Zero. In a benchmark with both local and remote communication, we further demonstrate how it outperforms the default policy servers provided by OpenVLA, OpenPi, and LeRobot. VLAgents is available at https://github.com/RobotControlStack/vlagents