PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing vision-language-action (VLA) models exhibit limited pixel-level scene understanding and over-rely on textual prompts, constraining generalization and operational flexibility. To address this, we propose the first VLA framework supporting pixel-level reasoning and multimodal (textual + visual) prompt fusion. Our method introduces a multi-scale pixel-aware encoder and a visual prompt encoder, coupled with a two-stage automated annotation pipeline to construct Pixel-160K—a large-scale pixel-annotated dataset. We further propose a vision-motion instruction fine-tuning paradigm that drastically reduces pretraining overhead. Evaluated on three standard benchmarks, our approach achieves absolute improvements of 10.1–17.8% in manipulation success rate, while requiring only 1.5% of OpenVLA’s pretraining cost. The framework thus delivers superior efficiency, accuracy, and cross-task generalization without sacrificing pixel-level grounding.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-17.8% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments. The dataset and code will be released as open source.

Problem

Research questions and friction points this paper is trying to address.

Addresses pixel-level scene understanding limitations in VLAs

Reduces heavy reliance on textual prompts for flexibility

Enables multimodal prompting with text and visual inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

PixelVLA integrates pixel-level reasoning with multimodal prompting

Uses multiscale pixel-aware encoder with visual prompting encoder

Employs automated annotation pipeline generating Pixel-160K dataset

🔎 Similar Papers

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring