🤖 AI Summary
This work addresses the limitations of existing vision-language pretraining models, which rely solely on global supervision signals and consequently exhibit constrained performance in instance-level reasoning and localization. To overcome this, we propose InstAP, a novel framework that introduces, for the first time, an instance-aware pretraining objective. InstAP jointly optimizes global image-text alignment with fine-grained instance-level contrastive alignment, thereby anchoring textual descriptions precisely to their corresponding spatiotemporal regions. To support this approach, we construct InstVL, the first large-scale dataset featuring dual-granularity annotations that encompass both holistic scene descriptions and dense instance-level labels. Experiments demonstrate that InstAP significantly outperforms state-of-the-art methods on instance retrieval tasks over the InstVL benchmark and achieves competitive results in zero-shot video evaluation on MSR-VTT and DiDeMo, effectively enhancing the model’s capacity for understanding and localizing specific visual instances.
📝 Abstract
Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.