TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses the limited fine-grained alignment between image patches and textual concepts in current vision-language foundation models. To enhance patch-text alignment, the authors propose a novel pretraining framework featuring several key innovations: patch-level knowledge distillation, an improved iBOT++ masked image modeling objective that incorporates unmasked patches directly into the loss computation, an optimized exponential moving average mechanism, and a multi-granularity synthetic caption sampling strategy. Built upon an efficient dual-encoder architecture, the method achieves state-of-the-art or competitive performance across nine task categories and twenty benchmark datasets, demonstrating its effectiveness in advancing visual representation learning.

Technology Category

Application Category

📝 Abstract

Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at https://gdm-tipsv2.github.io/ .

Problem

Research questions and friction points this paper is trying to address.

patch-text alignment

vision-language pretraining

dense representation

image-text alignment

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

patch-text alignment

vision-language pretraining

iBOT++