Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the high computational cost and inefficiency of existing Vision Transformer (ViT)-based multi-view 3D object detection methods, which stem from fixed-ratio token selection and full-model fine-tuning. To overcome these limitations, the authors propose a dynamic hierarchical token selection mechanism coupled with an image token compensator, enabling efficient and adaptive token pruning within the ViT backbone. This approach is integrated with a parameter-efficient fine-tuning strategy that trains only the newly introduced modules—comprising just 1.6 million parameters. Evaluated on the NuScenes dataset, the method reduces computational load by 48%–55% and inference latency by 9%–25% compared to the state-of-the-art ToC3D, while simultaneously improving mean Average Precision (mAP) by 1.0%–2.8% and NuScenes Detection Score (NDS) by 0.4%–1.2%.

Technology Category

Application Category

📝 Abstract

Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \texttt{ToC3D} for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \texttt{ToC3D}, our approach enables dynamic layer-wise token selection within the ViT backbone. Furthermore, we introduce a parameter-efficient fine-tuning strategy, which trains only the proposed modules, thereby reducing the number of fine-tuned parameters from more than $300$ million (M) to only $1.6$ M. Experiments on the large-scale NuScenes dataset across three multi-view 3D object detection approaches demonstrate that our proposed method decreases computational complexity (GFLOPs) by $48\%$ ... $55\%$, inference latency (on an \texttt{NVIDIA-GV100} GPU) by $9\%$ ... $25\%$, while still improving mean average precision by $1.0\%$ ... $2.8\%$ absolute and NuScenes detection score by $0.4\%$ ... $1.2\%$ absolute compared to so-far SOTA \texttt{ToC3D}.

Problem

Research questions and friction points this paper is trying to address.

multi-view 3D object detection

vision transformer

computational efficiency

token selection

parameter-efficient fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic token selection

parameter-efficient fine-tuning

multi-view 3D object detection