🤖 AI Summary
A critical domain shift and modeling disjunction exist between high-level reasoning in vision-language models (VLMs) and downstream vision-language-action (VLA) policy learning. Method: We propose Vlaser—a unified framework that integrates supervised VLA fine-tuning, spatial reasoning, embodied question answering, and task planning atop a VLM architecture—enabling joint modeling of multimodal perception and action policies. We further introduce the high-quality Vlaser-6M dataset to systematically investigate the impact of VLM initialization on VLA policy learning—the first such study. Results: Vlaser achieves state-of-the-art performance on the WidowX benchmark, outperforms prior methods on the Google Robot benchmark, and demonstrates significant gains across diverse embodied reasoning tasks. It effectively bridges the gap between upstream perceptual-reasoning capabilities and downstream robotic control, establishing a principled pathway from VLMs to embodied agents.
📝 Abstract
While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.