🤖 AI Summary
To address the insufficient dense representation capability of large-scale vision encoders for contextual scene understanding under unlabeled-data conditions, this paper proposes DIP—an unsupervised dense contextual post-training method. DIP is the first to introduce meta-learning into unsupervised post-training, leveraging a pre-trained diffusion model to automatically generate diverse contextual pseudo-tasks without human annotations or knowledge distillation. Its single-stage dense prediction framework directly optimizes pixel-level representation learning in vision encoders. Evaluated on multiple real-world scene understanding tasks—including semantic segmentation and depth estimation—DIP consistently outperforms both baseline models and state-of-the-art unsupervised methods. Notably, it achieves stable performance gains using only a single A100 GPU for nine hours, demonstrating exceptional efficiency and generalization.
📝 Abstract
We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: https://github.com/sirkosophia/DIP