FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models suffer from limited fine-grained spatial perception due to low-resolution pretraining and noisy, coarse-grained image-text pairs. This work proposes FineViT, introducing a novel progressive training paradigm grounded in dense re-annotation. It first establishes a robust semantic foundation using billions of globally re-annotated image-text pairs, then aligns the vision encoder with a large language model via 450 million high-quality localized captions (FineCap-450M). By integrating high-resolution from-scratch training with multi-stage learning, FineViT systematically mitigates visual detail loss. The model achieves state-of-the-art performance in zero-shot recognition and retrieval tasks, significantly outperforming prominent visual encoders such as SigLIP2 and Qwen-ViT—particularly in long-context retrieval—thereby establishing a new baseline for fine-grained visual perception.

Technology Category

Application Category

📝 Abstract
While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over $450$ million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.
Problem

Research questions and friction points this paper is trying to address.

fine-grained perception
visual encoder
dense spatial tasks
multimodal large language models
image-text pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

FineViT
dense recaptions
progressive training
fine-grained perception
vision encoder
🔎 Similar Papers
No similar papers found.
Peisen Zhao
Peisen Zhao
Huawei Inc.
X
Xiaopeng Zhang
Huawei Inc.
M
Mingxing Xu
Huawei Inc.
R
Ruoyu Sun
Huawei Inc.
Z
Zewei Du
Huawei Inc.
D
Dunzheng Wang
Huawei Inc.
G
Guanghao Zheng
Huawei Inc.
Haohang Xu
Haohang Xu
Shanghai JiaoTong University
Computer vision
Z
Zhibo Zhang
Huawei Inc.
Y
Yuhang Zhang
Huawei Inc.
Y
Yi Ai
Huawei Inc.
L
Lin Liu
Huawei Inc.
Q
Qi Tian
Huawei Inc.