ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work addresses the high computational cost of vision-language models in autonomous driving caused by multi-view, multi-frame inputs by proposing ST-Prune, a plug-and-play, training-free spatiotemporal pruning framework. ST-Prune is the first to jointly model spatiotemporal redundancy in driving scenes: it introduces Motion-aware Temporal Pruning (MTP) with soft constraints based on motion dynamics and temporal similarity to select critical temporal tokens, and integrates Ring-view Spatial Pruning (RSP) that leverages multi-camera geometric structure to suppress redundant cross-view information. Evaluated across four benchmarks spanning perception, prediction, and planning, ST-Prune achieves a new state-of-the-art for training-free pruning, maintaining near-lossless performance at 90% token compression—and even surpassing the full model on certain metrics—while matching the inference speed of existing methods.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Token Pruning
Autonomous Driving
Spatio-Temporal Redundancy
Computational Overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

token pruning
spatio-temporal redundancy
vision-language models
autonomous driving
training-free
L
Lin Sha
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Haiyun Guo
Haiyun Guo
Rice University ECE Ph.D.
optical imagingcomputational photographyMetalens
T
Tao Wang
Carizon, Beijing 100094, China
C
Cong Zhang
Carizon, Beijing 100094, China
M
Min Huang
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
J
Jinqiao Wang
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China; Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Q
Qinghai Miao
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China