PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the limitations of existing vision–language–action models in robotic manipulation, which often suffer from insufficient high-level semantic understanding and spatial perception. The authors propose a lightweight embodied manipulation foundation model trained in two stages: first, a compact vision–language model is pretrained on a large-scale multimodal dataset of 2.4 million samples; second, it is integrated with multi-view object perception, geometric alignment, and a novel action expert module to effectively map semantic representations into the action space. By unifying spatial localization, functional affordances, and embodied reasoning, the approach substantially enhances decision robustness. Evaluated on the LIBERO-Plus benchmark and real-world scenarios, the model achieves state-of-the-art performance, significantly improving task success rates and resilience to perturbations.

Technology Category

Application Category

📝 Abstract

Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

robot manipulation

spatial awareness

world knowledge

embodied reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action

two-stage training

spatial grounding