🤖 AI Summary
Existing multimodal large language models suffer from limited fine-grained spatial perception due to low-resolution pretraining and noisy, coarse-grained image-text pairs. This work proposes FineViT, introducing a novel progressive training paradigm grounded in dense re-annotation. It first establishes a robust semantic foundation using billions of globally re-annotated image-text pairs, then aligns the vision encoder with a large language model via 450 million high-quality localized captions (FineCap-450M). By integrating high-resolution from-scratch training with multi-stage learning, FineViT systematically mitigates visual detail loss. The model achieves state-of-the-art performance in zero-shot recognition and retrieval tasks, significantly outperforming prominent visual encoders such as SigLIP2 and Qwen-ViT—particularly in long-context retrieval—thereby establishing a new baseline for fine-grained visual perception.
📝 Abstract
While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over $450$ million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.