OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Poor generalization of Vision-Language-Action (VLA) models—particularly their sensitivity to camera/robot pose and limited transferability of 3D-aware policies to unseen instructions and objects—hampers real-world deployment. To address this, we propose a robust multi-view RGB-D-to-quasistatic-action mapping framework. Our core innovation is the first orthogonal projection image generation mechanism: leveraging point cloud reconstruction and canonical orthographic rendering, it transforms arbitrary-view RGB-D inputs into pose-invariant orthogonal views, effectively bridging 3D geometry with LLM-driven action grounding. The method integrates a ViT encoder, a large language model, and an image diffusion prior. On the Arnold and Colosseum benchmarks, it improves zero-shot generalization to unseen environments by over 40%, while preserving robustness in seen scenarios. Physical deployment requires only 3–5 demonstrations for rapid adaptation.

Technology Category

Application Category

📝 Abstract
We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and multi-view RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA projects input observations from diverse views into a point cloud which is then rendered from canonical orthographic views, ensuring input view invariance and consistency between input and output spaces. These canonical views are processed with a vision backbone, a Large Language Model (LLM), and an image diffusion model to generate images that encode the next position and orientation of the end-effector on the input scene. Evaluations on the Arnold and Colosseum benchmarks demonstrate state-of-the-art generalization to unseen environments, with over 40% relative improvements while maintaining robust performance in seen settings. We also show real-world adaption in 3 to 5 demonstrations along with strong generalization. Videos and resources at https://og-vla.github.io/
Problem

Research questions and friction points this paper is trying to address.

Mapping language instructions to robot actions robustly
Improving 3D-aware policies' generalization to unseen scenarios
Ensuring input view invariance for vision-language-action models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthographic image generation for view invariance
Combines VLA with 3D-aware keyframe policies
Uses LLM and diffusion model for end-effector prediction