cVLA: Towards Efficient Camera-Space VLAs

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing Vision-Language-Action (VLA) models suffer from high training costs and produce low-level control signals with limited generalizability. This paper proposes a lightweight vision-language-action framework that directly predicts end-effector trajectory waypoints in the camera coordinate frame—eliminating dependence on specific robot kinematics. Our method integrates a frozen vision-language model (VLM), depth-image inputs, inference-time decoding strategy optimization, and demonstration-conditioned action generation, trained exclusively on synthetic data. Key innovations include a camera-space trajectory prediction mechanism and a cross-domain generalization design, significantly enhancing sim-to-real transfer capability. Experiments demonstrate that the approach efficiently accomplishes complex manipulation tasks on both simulated and real robotic platforms, while reducing training cost and enabling flexible deployment. Results validate its effectiveness, practicality, and improved generalization over prior VLA approaches.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and exhibits strong sim-to-real transfer capabilities. We evaluate our approach using a combination of simulated and real data, demonstrating its effectiveness on a real robotic system.

Problem

Research questions and friction points this paper is trying to address.

Efficient VLA model for robotic manipulation tasks

Predicts trajectory waypoints in image frame coordinates

Lightweight design with strong sim-to-real transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts trajectory waypoints in image frame

Uses depth images and decoding strategies

Lightweight next-token prediction for trajectories

🔎 Similar Papers

AirSLAM: An Efficient and Illumination-Robust Point-Line Visual SLAM System