ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

๐Ÿ“… 2025-10-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing multimodal large language models (MLLMs) suffer from high inference overhead due to fixed, dense visual tokenization. This work proposes Vision Consistency Learning (ViCO), the first framework enabling **semantic-complexity-aware dynamic visual token compression**. ViCO introduces a multi-branch MLP connector and a learnable image-patch-level resolution router (ViR), jointly supporting fine-grained, adaptive control of visual token count. To preserve semantic fidelity across varying compression ratios, ViCO enforces output distribution consistency among multiple compression paths via KL-divergence regularization. The method is trained end-to-end and natively supports dynamic high-resolution inputs. Experiments demonstrate that ViCO reduces visual tokens by up to 50% while maintaining competitive performance in perception, reasoning, and OCR tasksโ€”yielding substantial reductions in inference computation cost and GPU memory consumption.

Technology Category

Application Category

๐Ÿ“ Abstract
Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs. In this work, we propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying semantic complexities using different numbers of vision tokens. The key idea behind our method is to employ multiple MLP connectors, each with a different image compression ratio, to downsample the vision tokens based on the semantic complexity of the image. During training, we minimize the KL divergence between the responses conditioned on different MLP connectors. At inference time, we introduce an image router, termed Visual Resolution Router (ViR), that automatically selects the appropriate compression rate for each image patch. Compared with existing dynamic high-resolution strategies, which adjust the number of visual tokens based on image resolutions, our method dynamically adapts the number of visual tokens according to semantic complexity. Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities. We hope this work will contribute to the development of more efficient MLLMs. The code and models will be released to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Reducing vision token count in MLLMs for efficiency
Adapting token compression based on image semantic complexity
Maintaining model capabilities while cutting computational costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token compression based on semantic complexity
Multiple MLP connectors with varying compression ratios
Visual Resolution Router for automatic token selection
๐Ÿ”Ž Similar Papers
No similar papers found.