Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited visual perception capability of multimodal large language models (MLLMs), this paper proposes a novel framework that dynamically adjusts visual token resolution within a single forward pass—inspired by human “saccadic” visual scanning. The method comprises two key components: (1) a layer-wise, attention-guided saliency scanning strategy that adaptively focuses computation on semantically critical regions; and (2) a plug-and-play Token Super-Resolution (TokenSR) module enabling dynamic expansion or pruning of token-level computational resources. Crucially, the approach requires no additional training or architectural modification, enhancing visual representation quality efficiently during inference. Evaluated on multiple vision-language understanding benchmarks—including MMBench, OCRBench, and TextVQA—the method consistently outperforms strong baselines, demonstrating that dynamic resolution control meaningfully improves both visual perception fidelity and downstream multimodal reasoning performance.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.
Problem

Research questions and friction points this paper is trying to address.

Enhances visual perception in multimodal language models
Dynamically allocates computation to salient visual tokens
Improves efficiency and adaptability in multimodal understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic visual token resolution framework
Saliency-guided scanning and token super-resolution
Adaptive token extension and dropping mechanism
🔎 Similar Papers
No similar papers found.
Y
Yuchen Feng
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Z
Zhenyu Zhang
Baidu Inc., Beijing, China
N
Naibin Gu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Y
Yilong Chen
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
P
Peng Fu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Z
Zheng Lin
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Shuohuan Wang
Shuohuan Wang
Baidu
Natural Language ProcessingDeep Learning
Y
Yu Sun
Baidu Inc., Beijing, China
H
Hua Wu
Baidu Inc., Beijing, China
Weiping Wang
Weiping Wang
School of Information Science and Engineering, Central South University
Computer NetworkNetwork Security
H
Haifeng Wang
Baidu Inc., Beijing, China