ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost incurred by multi-view historical frame inputs in vision-language action models, stemming from the quadratic complexity of self-attention mechanisms. To mitigate this, the authors propose the Intra-LLM Sparse Aggregator (ILSA), a novel architecture inspired by human drivers’ attention mechanisms. ILSA dynamically prunes redundant visual tokens through text-guided token scoring, temporal consistency constraints, and built-in sparsification within a large language model, achieving efficient compression while preserving scene perception fidelity. Evaluated on NAVSIM v2, ILSA retains only 15% of visual tokens, reducing inference FLOPs by 61% and overall computation by 32%, yet maintains 94% of the original accuracy—matching state-of-the-art performance.
📝 Abstract
The integration of Vision-Language-Action (VLA) models into autonomous driving systems offers a unified framework for interpreting complex scenes and executing control commands. However, the necessity to incorporate historical multi-view frames for accurate temporal reasoning imposes a severe computational burden, primarily driven by the quadratic complexity of self-attention mechanisms in Large Language Models (LLMs). To alleviate this bottleneck, we propose ETA-VLA, an Efficient Token Adaptation framework for VLA models. ETA-VLA processes the past $n$ frames of multi-view images and introduces a novel Intra-LLM Sparse Aggregator (ILSA). Drawing inspiration from human driver attention allocation, ILSA dynamically identifies and prunes redundant visual tokens guided by textual queries and temporal consistency. Specifically, we utilize a text-guided scoring mechanism alongside a diversity-preserving sparsification strategy to select a sparse subset of critical tokens, ensuring comprehensive awareness of the driving scene. Extensive experiments on the NAVSIM v2 demonstrate that ETA-VLA achieves driving performance comparable to state-of-the-art baselines while reducing computational FLOPs by approximately 32\%. Notably, our method prunes 85% of visual tokens and reduces inference FLOPs by 61\%, but still retaining 94% of the original accuracy on the NAVSIM v2 benchmark.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
temporal reasoning
computational burden
self-attention complexity
autonomous driving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Sparsification
Vision-Language-Action Models
Temporal Fusion
Efficient LLM Inference
Text-Guided Attention
🔎 Similar Papers
No similar papers found.
Yiru Wang
Yiru Wang
University of Pittsburgh
Econometrics
A
Anqing Jiang
Bosch Corporate Research, Bosch (China) Investment Ltd., Shanghai, China
S
Shuo Wang
Bosch Corporate Research, Bosch (China) Investment Ltd., Shanghai, China
Y
Yuwen Heng
Bosch Corporate Research, Bosch (China) Investment Ltd., Shanghai, China
Z
Zichong Gu
School of Communication and Information Engineering, Shanghai University, Shanghai, China
H
Hao Sun
Bosch Corporate Research, Bosch (China) Investment Ltd., Shanghai, China