🤖 AI Summary
This work addresses the high computational cost of processing massive visual tokens in multimodal large language models (MLLMs) for end-to-end autonomous driving, where existing token compression methods often degrade performance. To this end, we propose SToRM—the first supervised token compression framework tailored for MLLMs—featuring a lightweight importance predictor based on a sliding window, pseudo-label supervision signals derived from an auxiliary path generation process, and an anchor-context merging mechanism to enable efficient, low-loss compression. Evaluated on the LangAuto benchmark, SToRM significantly outperforms existing approaches under the same compression budget, reducing computational costs by up to 30× while preserving the performance level achieved with full token usage.
📝 Abstract
In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.