Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Visual-language-action (VLA) models face a generalization bottleneck in cross-view deployment due to misalignment between perception space (camera frame) and action space (robot base frame). This work proposes an observation-centric VLA framework that, for the first time, directly predicts end-effector pose within the camera observation space and explicitly aligns perception and action coordinate systems via the camera extrinsic matrix. The method requires no architectural modifications—enabling plug-and-play integration—and significantly improves robustness to camera pose variations. Evaluated in both simulation and on real robotic platforms, the approach accelerates training convergence, increases task success rates, and substantially enhances cross-view policy transferability.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models frequently encounter challenges in generalizing to real-world environments due to inherent discrepancies between observation and action spaces. Although training data are collected from diverse camera perspectives, the models typically predict end-effector poses within the robot base coordinate frame, resulting in spatial inconsistencies. To mitigate this limitation, we introduce the Observation-Centric VLA (OC-VLA) framework, which grounds action predictions directly in the camera observation space. Leveraging the camera's extrinsic calibration matrix, OC-VLA transforms end-effector poses from the robot base coordinate system into the camera coordinate system, thereby unifying prediction targets across heterogeneous viewpoints. This lightweight, plug-and-play strategy ensures robust alignment between perception and action, substantially improving model resilience to camera viewpoint variations. The proposed approach is readily compatible with existing VLA architectures, requiring no substantial modifications. Comprehensive evaluations on both simulated and real-world robotic manipulation tasks demonstrate that OC-VLA accelerates convergence, enhances task success rates, and improves cross-view generalization. The code will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Addresses spatial inconsistencies in Vision-Language-Action models

Unifies action predictions across diverse camera viewpoints

Improves model resilience to camera viewpoint variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

OC-VLA grounds actions in camera space

Uses extrinsic calibration for coordinate transformation

Plug-and-play strategy enhances viewpoint robustness

🔎 Similar Papers

GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance