TIPS: Text-Image Pretraining with Spatial Awareness

📅 2024-10-21

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Current vision-language pretraining models exhibit weak spatial awareness, limiting their direct applicability to dense prediction tasks such as semantic segmentation and depth estimation—thus still necessitating unsupervised image pretraining. To address this, we propose the first general-purpose vision-language pretraining framework that jointly supports global and local understanding. Our method introduces (1) dual-source text supervision—integrating synthetic captions with noisy web-crawled text—to enhance fine-grained cross-modal alignment; and (2) the first unified training paradigm synergizing contrastive vision-language learning with masked image modeling, explicitly enforcing pixel-token spatial consistency. Built upon a Transformer architecture, it incorporates synthetic text generation and hybrid training strategies. Evaluated across 8 task categories and 16 benchmarks, our model achieves out-of-the-box state-of-the-art performance: +4.2 average mIoU gain on dense prediction tasks and +5.7% R@1 improvement on image retrieval.

Technology Category

Application Category

📝 Abstract

While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised image-only pretraining is still the go-to method for many dense vision applications (e.g. depth estimation, semantic segmentation), despite the lack of explicit supervisory signals. In this paper, we close this gap between image-text and self-supervised learning, by proposing a novel general-purpose image-text model, which can be effectively used off the shelf for dense and global vision tasks. Our method, which we refer to as Text-Image Pretraining with Spatial awareness (TIPS), leverages two simple and effective insights. First, on textual supervision: we reveal that replacing noisy web image captions by synthetically generated textual descriptions boosts dense understanding performance significantly, due to a much richer signal for learning spatially aware representations. We propose an adapted training method that combines noisy and synthetic captions, resulting in improvements across both dense and global understanding tasks. Second, on the learning technique: we propose to combine contrastive image-text learning with self-supervised masked image modeling, to encourage spatial coherence, unlocking substantial enhancements for downstream applications. Building on these two ideas, we scale our model using the transformer architecture, trained on a curated set of public images. Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks. Code and models are released at https://github.com/google-deepmind/tips.

Problem

Research questions and friction points this paper is trying to address.

Enhances spatial awareness in image-text models for dense vision tasks.

Improves dense understanding by using synthetic textual descriptions.

Combines contrastive learning with masked image modeling for spatial coherence.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines noisy and synthetic captions for training

Integrates contrastive learning with masked image modeling

Utilizes transformer architecture for scalable model training

🔎 Similar Papers

No similar papers found.