ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the high computational cost incurred by multi-view historical frame inputs in vision-language action models, stemming from the quadratic complexity of self-attention mechanisms. To mitigate this, the authors propose the Intra-LLM Sparse Aggregator (ILSA), a novel architecture inspired by human drivers’ attention mechanisms. ILSA dynamically prunes redundant visual tokens through text-guided token scoring, temporal consistency constraints, and built-in sparsification within a large language model, achieving efficient compression while preserving scene perception fidelity. Evaluated on NAVSIM v2, ILSA retains only 15% of visual tokens, reducing inference FLOPs by 61% and overall computation by 32%, yet maintains 94% of the original accuracy—matching state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

The integration of Vision-Language-Action (VLA) models into autonomous driving systems offers a unified framework for interpreting complex scenes and executing control commands. However, the necessity to incorporate historical multi-view frames for accurate temporal reasoning imposes a severe computational burden, primarily driven by the quadratic complexity of self-attention mechanisms in Large Language Models (LLMs). To alleviate this bottleneck, we propose ETA-VLA, an Efficient Token Adaptation framework for VLA models. ETA-VLA processes the past $n$ frames of multi-view images and introduces a novel Intra-LLM Sparse Aggregator (ILSA). Drawing inspiration from human driver attention allocation, ILSA dynamically identifies and prunes redundant visual tokens guided by textual queries and temporal consistency. Specifically, we utilize a text-guided scoring mechanism alongside a diversity-preserving sparsification strategy to select a sparse subset of critical tokens, ensuring comprehensive awareness of the driving scene. Extensive experiments on the NAVSIM v2 demonstrate that ETA-VLA achieves driving performance comparable to state-of-the-art baselines while reducing computational FLOPs by approximately 32\%. Notably, our method prunes 85% of visual tokens and reduces inference FLOPs by 61\%, but still retaining 94% of the original accuracy on the NAVSIM v2 benchmark.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

temporal reasoning

computational burden

self-attention complexity

autonomous driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Sparsification

Vision-Language-Action Models

Temporal Fusion