🤖 AI Summary
Visual-language-action (VLA) models face a generalization bottleneck in cross-view deployment due to misalignment between perception space (camera frame) and action space (robot base frame). This work proposes an observation-centric VLA framework that, for the first time, directly predicts end-effector pose within the camera observation space and explicitly aligns perception and action coordinate systems via the camera extrinsic matrix. The method requires no architectural modifications—enabling plug-and-play integration—and significantly improves robustness to camera pose variations. Evaluated in both simulation and on real robotic platforms, the approach accelerates training convergence, increases task success rates, and substantially enhances cross-view policy transferability.
📝 Abstract
Vision-Language-Action (VLA) models frequently encounter challenges in generalizing to real-world environments due to inherent discrepancies between observation and action spaces. Although training data are collected from diverse camera perspectives, the models typically predict end-effector poses within the robot base coordinate frame, resulting in spatial inconsistencies. To mitigate this limitation, we introduce the Observation-Centric VLA (OC-VLA) framework, which grounds action predictions directly in the camera observation space. Leveraging the camera's extrinsic calibration matrix, OC-VLA transforms end-effector poses from the robot base coordinate system into the camera coordinate system, thereby unifying prediction targets across heterogeneous viewpoints. This lightweight, plug-and-play strategy ensures robust alignment between perception and action, substantially improving model resilience to camera viewpoint variations. The proposed approach is readily compatible with existing VLA architectures, requiring no substantial modifications. Comprehensive evaluations on both simulated and real-world robotic manipulation tasks demonstrate that OC-VLA accelerates convergence, enhances task success rates, and improves cross-view generalization. The code will be publicly available.