🤖 AI Summary
Existing Vision-Language-Action (VLA) models suffer from high training costs and produce low-level control signals with limited generalizability. This paper proposes a lightweight vision-language-action framework that directly predicts end-effector trajectory waypoints in the camera coordinate frame—eliminating dependence on specific robot kinematics. Our method integrates a frozen vision-language model (VLM), depth-image inputs, inference-time decoding strategy optimization, and demonstration-conditioned action generation, trained exclusively on synthetic data. Key innovations include a camera-space trajectory prediction mechanism and a cross-domain generalization design, significantly enhancing sim-to-real transfer capability. Experiments demonstrate that the approach efficiently accomplishes complex manipulation tasks on both simulated and real robotic platforms, while reducing training cost and enabling flexible deployment. Results validate its effectiveness, practicality, and improved generalization over prior VLA approaches.
📝 Abstract
Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and exhibits strong sim-to-real transfer capabilities. We evaluate our approach using a combination of simulated and real data, demonstrating its effectiveness on a real robotic system.