Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency and loss of fine-grained visual details in grounded conversational generation (GCG) caused by excessive visual tokens, this paper proposes the Adaptive Local-aware Token Pruning (ALTP) framework. ALTP introduces two key innovations: (1) a Detail Density Capture (DDC) mechanism that models pixel-level visual detail distribution via superpixel segmentation; and (2) a Dynamic Density Formation (DDF) strategy that adaptively allocates token budgets based on semantic importance. Seamlessly integrated into multimodal architectures such as GLaMM and OMG-LLaVA, ALTP achieves 90% visual token compression on the GranDf benchmark while improving grounding accuracy—boosting GLaMM’s AP₅₀ by +4.9% and Recall by +5.0%, and OMG-LLaVA’s AP by +2.1% and mIoU by +3.0%. The framework thus significantly balances inference efficiency with pixel-level alignment fidelity.

Technology Category

Application Category

📝 Abstract
Grounded Conversation Generation (GCG) is an emerging vision-language task that requires models to generate natural language responses seamlessly intertwined with corresponding object segmentation masks. Recent models, such as GLaMM and OMG-LLaVA, achieve pixel-level grounding but incur significant computational costs due to processing a large number of visual tokens. Existing token pruning methods, like FastV and PyramidDrop, fail to preserve the local visual features critical for accurate grounding, leading to substantial performance drops in GCG tasks. To address this, we propose Adaptive Local-Aware Token Pruning (ALTP), a simple yet effective framework that accelerates GCG models by prioritizing local object information. ALTP introduces two key components: (1) Detail Density Capture (DDC), which uses superpixel segmentation to retain tokens in object-centric regions, preserving fine-grained details, and (2) Dynamic Density Formation (DDF), which dynamically allocates tokens based on information density, ensuring higher retention in semantically rich areas. Extensive experiments on the GranDf dataset demonstrate that ALTP significantly outperforms existing token pruning methods, such as FastV and PyramidDrop, on both GLaMM and OMG-LLaVA models. Notably, when applied to GLaMM, ALTP achieves a 90% reduction in visual tokens with a 4.9% improvement in AP50 and a 5.0% improvement in Recall compared to PyramidDrop. Similarly, on OMG-LLaVA, ALTP improves AP by 2.1% and mIOU by 3.0% at a 90% token reduction compared with PDrop.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs in Grounded Conversation Generation models
Preserves local visual features for accurate object grounding
Improves performance with adaptive token pruning techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Local-Aware Token Pruning (ALTP) framework
Detail Density Capture (DDC) for object-centric regions
Dynamic Density Formation (DDF) for semantic-rich areas
🔎 Similar Papers
No similar papers found.
B
Bizhe Bai
Fudan University, Shanghai Innovation Institute
Jianjian Cao
Jianjian Cao
Fudan University
Multimodal LearningModel CompressMLLM
Yadan Luo
Yadan Luo
ARC DECRA and Senior Lecturer, University of Queensland
Generalization3D VisionAutonomous Driving
T
Tao Che
Fudan University, Shanghai Innovation Institute