Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

๐Ÿ“… 2025-12-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address high latency and computational overhead in real-time deployment of large-scale vision-language-action (VLA) models, this paper proposes TEAM-VLAโ€”a training-free, dynamic token compression framework. TEAM-VLA introduces a novel two-stage โ€œexpand-mergeโ€ paradigm: first, spatially aware token expansion guided by attention hotspots; second, adaptive token merging steered by action semantics via hierarchical clustering and parameter-free differentiable fusion. This design preserves contextual integrity while ensuring action-semantic consistency, requiring zero additional training and enabling plug-and-play integration. Evaluated on the LIBERO benchmark, TEAM-VLA achieves up to 2.3ร— inference speedup with maintained or improved task success rates (+1.8%), significantly advancing the practical deployment of VLA models on resource-constrained robotic systems.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining semantic coherence. By coupling expansion and merging within a single feed-forward pass, TEAM-VLA achieves a balanced trade-off between efficiency and effectiveness, without any retraining or parameter updates. Extensive experiments on LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining or even surpassing the task success rate of full VLA models. The code is public available on href{https://github.com/Jasper-aaa/TEAM-VLA}{https://github.com/Jasper-aaa/TEAM-VLA}
Problem

Research questions and friction points this paper is trying to address.

Accelerates VLA inference while preserving task performance
Reduces token redundancy without retraining or parameter updates
Improves inference speed while maintaining task success rate
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free token compression for Vision-Language-Action models
Dynamic token expansion near attention-highlighted regions
Action-aware token merging to reduce redundancy
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yifan Ye
College of Science and Engineering, Hamad Bin Khalifa University, Education City, Doha 24404, Qatar
J
Jiaqi Ma
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi 23201, UAE
J
Jun Cen
College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China
Zhihe Lu
Zhihe Lu
HBKU<--NUS<--University of Surrey<--CASIA
Computer VisionTransfer LearningFew-shot LearningMultimodel LearningContinual Learning