Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address high latency and computational overhead in real-time deployment of large-scale vision-language-action (VLA) models, this paper proposes TEAM-VLA—a training-free, dynamic token compression framework. TEAM-VLA introduces a novel two-stage “expand-merge” paradigm: first, spatially aware token expansion guided by attention hotspots; second, adaptive token merging steered by action semantics via hierarchical clustering and parameter-free differentiable fusion. This design preserves contextual integrity while ensuring action-semantic consistency, requiring zero additional training and enabling plug-and-play integration. Evaluated on the LIBERO benchmark, TEAM-VLA achieves up to 2.3× inference speedup with maintained or improved task success rates (+1.8%), significantly advancing the practical deployment of VLA models on resource-constrained robotic systems.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining semantic coherence. By coupling expansion and merging within a single feed-forward pass, TEAM-VLA achieves a balanced trade-off between efficiency and effectiveness, without any retraining or parameter updates. Extensive experiments on LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining or even surpassing the task success rate of full VLA models. The code is public available on href{https://github.com/Jasper-aaa/TEAM-VLA}{https://github.com/Jasper-aaa/TEAM-VLA}

Problem

Research questions and friction points this paper is trying to address.

Accelerates VLA inference while preserving task performance

Reduces token redundancy without retraining or parameter updates

Improves inference speed while maintaining task success rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free token compression for Vision-Language-Action models

Dynamic token expansion near attention-highlighted regions

Action-aware token merging to reduce redundancy

🔎 Similar Papers

VoCo-LLaMA: Towards Vision Compression with Large Language Models