🤖 AI Summary
This work addresses the high computational cost incurred by multi-view historical frame inputs in vision-language action models, stemming from the quadratic complexity of self-attention mechanisms. To mitigate this, the authors propose the Intra-LLM Sparse Aggregator (ILSA), a novel architecture inspired by human drivers’ attention mechanisms. ILSA dynamically prunes redundant visual tokens through text-guided token scoring, temporal consistency constraints, and built-in sparsification within a large language model, achieving efficient compression while preserving scene perception fidelity. Evaluated on NAVSIM v2, ILSA retains only 15% of visual tokens, reducing inference FLOPs by 61% and overall computation by 32%, yet maintains 94% of the original accuracy—matching state-of-the-art performance.
📝 Abstract
The integration of Vision-Language-Action (VLA) models into autonomous driving systems offers a unified framework for interpreting complex scenes and executing control commands. However, the necessity to incorporate historical multi-view frames for accurate temporal reasoning imposes a severe computational burden, primarily driven by the quadratic complexity of self-attention mechanisms in Large Language Models (LLMs). To alleviate this bottleneck, we propose ETA-VLA, an Efficient Token Adaptation framework for VLA models. ETA-VLA processes the past $n$ frames of multi-view images and introduces a novel Intra-LLM Sparse Aggregator (ILSA). Drawing inspiration from human driver attention allocation, ILSA dynamically identifies and prunes redundant visual tokens guided by textual queries and temporal consistency. Specifically, we utilize a text-guided scoring mechanism alongside a diversity-preserving sparsification strategy to select a sparse subset of critical tokens, ensuring comprehensive awareness of the driving scene. Extensive experiments on the NAVSIM v2 demonstrate that ETA-VLA achieves driving performance comparable to state-of-the-art baselines while reducing computational FLOPs by approximately 32\%. Notably, our method prunes 85% of visual tokens and reduces inference FLOPs by 61\%, but still retaining 94% of the original accuracy on the NAVSIM v2 benchmark.