🤖 AI Summary
This work addresses the challenge in visual self-supervised learning that the inherent uncertainty of masked regions impedes the acquisition of semantically rich representations. To overcome this limitation, the authors propose the Text-Conditioned Joint Embedding Predictive Architecture (TC-JEPA), which introduces image captions as textual conditions into the JEPA framework for the first time. By leveraging fine-grained text modulation and a sparse cross-attention mechanism, TC-JEPA enables predictable modeling of features corresponding to masked image patches. This approach establishes a novel vision–language pretraining paradigm that operates without contrastive learning, significantly enhancing performance and training stability on downstream tasks. Notably, it outperforms existing contrastive methods in fine-grained visual understanding and reasoning, while demonstrating strong scalability.
📝 Abstract
Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a fine-grained text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature prediction only, outperforming contrastive methods on diverse tasks, especially those requiring fine-grained visual understanding and reasoning.