Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Visual-language-action (VLA) models suffer from high computational overhead due to redundant visual tokens, severely hindering real-time robotic deployment. Existing task-agnostic pruning methods struggle to jointly preserve global semantic coherence and fine-grained spatial details. To address this, we propose Compressor-VLA: an instruction-guided dual-path visual token compression framework. Its core innovation lies in introducing, for the first time, a natural language instruction-modulated semantic task compressor and a spatial refinement compressor—respectively modeling task-relevant semantic context and critical spatial structures—to enable dynamic, adaptive visual information condensation. Evaluated on the LIBERO benchmark, Compressor-VLA achieves competitive performance while reducing FLOPs by 59% and decreasing visual token count by over 3×. This significantly enhances inference efficiency and deployment feasibility of VLA models on real-world robotic platforms.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in Vision-Language-Action models for robots

Preserves task-critical visual information through instruction-guided compression

Enables efficient real-time robotic manipulation with sim-to-real transferability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-conditioned hybrid token compression framework

Semantic Task Compressor preserves task-relevant context

Spatial Refinement Compressor maintains fine-grained spatial details

🔎 Similar Papers

VoCo-LLaMA: Towards Vision Compression with Large Language Models