🤖 AI Summary
Existing visual representation learning methods struggle to simultaneously achieve global semantic alignment and local spatial precision. To address this challenge, this work proposes MTV, a multi-task vision pretraining framework that systematically integrates vision-language contrastive learning, self-supervised learning, and dense spatial supervision within a shared backbone architecture. The framework leverages foundation models—including CLIP, MAE, and DINO—together with Depth Anything V2 and OWLv2 to generate high-quality dense pseudo-labels, thereby eliminating the need for manual annotation. Through joint optimization of these complementary tasks, MTV uncovers both synergistic and interfering interactions among them, significantly enhancing fine-grained spatial reasoning while preserving robust global semantic understanding. This approach enables scalable, general-purpose visual encoders that achieve a balanced “best-of-both-worlds” representation.
📝 Abstract
Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity"expert"models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves"best-of-both-worlds"performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.