Geometry-Guided 3D Visual Token Pruning for Video-Language Models

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
Existing 3D vision-language models suffer from inefficient inference due to processing highly redundant spatial video tokens, and prevailing pruning methods often neglect both view consistency and spatial diversity. This work proposes Geo3DPruner, a novel framework that introduces, for the first time, a geometry-guided mechanism to model inter-frame correlations through geometric-aware global attention. It employs a two-stage pruning strategy: intra-voxel selection of multi-view representative features to preserve view consistency, and inter-voxel retention of spatially diverse tokens to maintain scene structural integrity. Evaluated across multiple 3D scene understanding benchmarks, the method retains over 90% of the original model performance while pruning 90% of visual tokens, significantly outperforming existing text- or vision-guided pruning approaches.

Technology Category

Application Category

📝 Abstract
Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference and context management. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then performs a two-stage pruning process. The intra-voxel stage selects representative multi-view features within each voxel, while the inter-voxel stage preserves spatial diversity by selecting a globally distributed subset of voxels. Extensive experiments on multiple 3D scene understanding benchmarks demonstrate that Geo3DPruner retains over 90% of the original performance while pruning 90% of visual tokens, significantly outperforming existing text-guided and vision-guided pruning methods.
Problem

Research questions and friction points this paper is trying to address.

3D visual token pruning
spatial video redundancy
view consistency
spatial diversity
multimodal large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-Guided Pruning
3D Visual Token
Spatial Video
Voxel-Based Selection
Multimodal Large Language Models