🤖 AI Summary
This work addresses the challenge of jointly modeling fine-grained object localization and semantic description in video understanding. We propose VoCap, a unified multimodal prompting framework that simultaneously performs video object segmentation and object-centric captioning, driven by text, bounding boxes, and masks—marking the first such approach. Our method leverages large-scale vision-language models to generate high-quality pseudo-labels, then jointly optimizes spatiotemporal mask prediction and semantic caption generation conditioned on multimodal prompts. VoCap achieves state-of-the-art performance on referring-expression-guided video object segmentation, significantly outperforming existing methods even under semi-supervised settings. Additionally, we introduce SAV-Caption—the first dedicated benchmark for video object description—comprising diverse, densely annotated video clips with object-level captions and masks. This work establishes a new paradigm for promptable video understanding and provides foundational resources, including a novel model architecture, a large-scale pseudo-labeled training corpus, and a standardized evaluation benchmark.
📝 Abstract
Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.