GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision encoders in multimodal large language models (MLLMs) emphasize global representations but lack fine-grained region awareness, hindered by scarce region-level annotations and the absence of dedicated pretraining paradigms. To address this, we propose GranViT—a Vision Transformer tailored for fine-grained visual understanding. Our contributions are threefold: (1) we introduce Gran-29M, the first large-scale fine-grained image-text dataset with dense region–text alignments; (2) we design a region-level autoregressive pretraining paradigm that jointly optimizes bidirectional regression—bounding box to text and text to bounding box; and (3) we incorporate a self-distillation mechanism to strengthen localization–semantic alignment. Experiments demonstrate that GranViT achieves state-of-the-art performance on fine-grained recognition, multimodal visual question answering, and OCR understanding tasks, while exhibiting strong transferability and generalization across diverse downstream applications.

Technology Category

Application Category

📝 Abstract
Vision encoders are indispensable for allowing impressive performance of Multi-modal Large Language Models (MLLMs) in vision language tasks such as visual question answering and reasoning. However, existing vision encoders focus on global image representations but overlook fine-grained regional analysis. They are limited in fine grained perception due to the scarcity of fine grained annotated data and the lack of a fine grained pre-training paradigm. In this paper, we propose GranViT, a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to Large Language Models (LLMs) via region level autoregressive training. We first construct Gran-29M, a dataset comprising 2million natural and OCR images paired with over 180 million high-quality region-level annotations, to enable large scale fine grained pretraining. Consequently, we develop a pretraining-adaptation framework along with a self distillation mechanism to train fine-grained GranViT on Gran-29M. We sufficiently exploit the fine-grained annotations from Gran-29M to resort to bounding-box-to-caption regression to enhance localized visual representation of the vision encoder in the pretraining and caption-to-bounding-box regression to improve vision feature utilization and localization for LLM in the adaptation. We further incorporate a self distillation mechanism that imposes explicit localization constraints on the vision encoder to strengthen its regional reasoning capability. Extensive experiments show that GranViT surpasses existing vision encoders and attains strong transferability to varying LLMs. Remarkably, it achieves state-of-the-art results on fine-grained recognition, multimodal VQA, and OCR understanding.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited fine-grained perception in vision encoders
Overcomes scarcity of fine-grained annotated training data
Enhances regional reasoning for multimodal language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive region-level training for fine-grained perception
Bounding-box-to-caption regression for localized representation
Self-distillation mechanism with explicit localization constraints
🔎 Similar Papers
No similar papers found.
G
Guanghao Zheng
Shanghai Jiao Tong University, Shanghai, China
B
Bowen Shi
Shanghai Jiao Tong University, Shanghai, China
M
Mingxing Xu
Huawei Inc., China
R
Ruoyu Sun
Huawei Inc., China
Peisen Zhao
Peisen Zhao
Huawei Inc.
Z
Zhibo Zhang
Huawei Inc., China
Wenrui Dai
Wenrui Dai
Shanghai Jiao Tong University
Predictive ModelingImage/Video CodingSignal Processing
Junni Zou
Junni Zou
Professor, Shanghai Jiao Tong University
Multimedia communications - network resource optimization
Hongkai Xiong
Hongkai Xiong
Distinguished Professor, Shanghai Jiao Tong University
Image and video codingsignal processingmultimedia communicationvision and learning
X
Xiaopeng Zhang
Huawei Inc., China
Q
Qi Tian
Huawei Inc., China