🤖 AI Summary
This work addresses the limitations of existing vision–language–action models in robotic manipulation, which often suffer from insufficient high-level semantic understanding and spatial perception. The authors propose a lightweight embodied manipulation foundation model trained in two stages: first, a compact vision–language model is pretrained on a large-scale multimodal dataset of 2.4 million samples; second, it is integrated with multi-view object perception, geometric alignment, and a novel action expert module to effectively map semantic representations into the action space. By unifying spatial localization, functional affordances, and embodied reasoning, the approach substantially enhances decision robustness. Evaluated on the LIBERO-Plus benchmark and real-world scenarios, the model achieves state-of-the-art performance, significantly improving task success rates and resilience to perturbations.
📝 Abstract
Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA