Rethinking Token Reduction for Large Vision-Language Models

๐Ÿ“… 2026-03-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the high computational cost of large vision-language models (LVLMs) in multi-turn visual question answering (MT-VQA), primarily caused by redundant visual tokens and exacerbated by the challenge of preserving information relevant to future, unknown questions. To overcome the limitations of existing heuristic compression methods, the authors propose MetaCompressโ€”a novel framework that unifies token pruning and merging into a learnable, prompt-agnostic compression mapping. They further introduce an efficient training paradigm tailored to optimize visual representations for multi-turn dialogue scenarios. MetaCompress surpasses current approaches across multiple LVLM architectures and MT-VQA benchmarks, achieving a superior trade-off between efficiency and accuracy while maintaining strong generalization across conversation turns.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.
Problem

Research questions and friction points this paper is trying to address.

token reduction
multi-turn VQA
vision-language models
visual tokens
inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

token reduction
multi-turn VQA
learning-based compression
prompt-agnostic
vision-language models
๐Ÿ”Ž Similar Papers
No similar papers found.
Yi Wang
Yi Wang
Ph.D. Student of CS, ZJU
Generative ModelMultimodal Foundation Model
H
Haofei Zhang
State Key Laboratory of Blockchain and Data Security, Zhejiang University; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Qihan Huang
Qihan Huang
PhD Student, Zhejiang University
A
Anda Cao
College of Computer Science and Technology, Zhejiang University
Gongfan Fang
Gongfan Fang
National University of Singapore
Efficient Deep LearningGenerative ModelsLarge Language Models
Wei Wang
Wei Wang
Tongyi Lab, Alibaba Group
Generative Models
X
Xuan Jin
Alibaba Group
Jie Song
Jie Song
Professor, University of Massachusetts Chan Medical School
biomaterialsregenerative medicine
M
Mingli Song
College of Computer Science and Technology, Zhejiang University; State Key Laboratory of Blockchain and Data Security, Zhejiang University; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security; School of Software Technology, Zhejiang University
Xinchao Wang
Xinchao Wang
National University of Singapore
Machine LearningAIComputer VisionImage ProcessingNatural Language Processing