PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
This work addresses the limitations of existing vision–language–action models in robotic manipulation, which often suffer from insufficient high-level semantic understanding and spatial perception. The authors propose a lightweight embodied manipulation foundation model trained in two stages: first, a compact vision–language model is pretrained on a large-scale multimodal dataset of 2.4 million samples; second, it is integrated with multi-view object perception, geometric alignment, and a novel action expert module to effectively map semantic representations into the action space. By unifying spatial localization, functional affordances, and embodied reasoning, the approach substantially enhances decision robustness. Evaluated on the LIBERO-Plus benchmark and real-world scenarios, the model achieves state-of-the-art performance, significantly improving task success rates and resilience to perturbations.

Technology Category

Application Category

📝 Abstract
Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
robot manipulation
spatial awareness
world knowledge
embodied reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action
two-stage training
spatial grounding
geometry alignment
action expert
🔎 Similar Papers
No similar papers found.