Towards Visual Query Segmentation in the Wild

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work proposes a novel paradigm, Visual Query Segmentation (VQS), to address the limitations of existing visual query localization methods that typically annotate only the last occurrence of a target with a bounding box, which is insufficient for fine-grained understanding. VQS aims to perform pixel-level segmentation of all instances of a target object in untrimmed videos based on an external visual query. To facilitate research in this direction, we introduce VQS-4K, the first large-scale benchmark dataset for this task, and present VQ-SAM, a method built upon the SAM 2 architecture. VQ-SAM incorporates an Adaptive Memory Generation (AMG) module that fuses target-specific cues through multi-stage memory evolution while suppressing background distractions. Experiments demonstrate that VQ-SAM significantly outperforms existing approaches on VQS-4K, validating the effectiveness of the proposed paradigm and offering more comprehensive and precise visual query localization for real-world applications.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce visual query segmentation (VQS), a new paradigm of visual query localization (VQL) that aims to segment all pixel-level occurrences of an object of interest in an untrimmed video, given an external visual query. Compared to existing VQL locating only the last appearance of a target using bounding boxes, VQS enables more comprehensive (i.e., all object occurrences) and precise (i.e., pixel-level masks) localization, making it more practical for real-world scenarios. To foster research on this task, we present VQS-4K, a large-scale benchmark dedicated to VQS. Specifically, VQS-4K contains 4,111 videos with more than 1.3 million frames and covers a diverse set of 222 object categories. Each video is paired with a visual query defined by a frame outside the search video and its target mask, and annotated with spatial-temporal masklets corresponding to the queried target. To ensure high quality, all videos in VQS-4K are manually labeled with meticulous inspection and iterative refinement. To the best of our knowledge, VQS-4K is the first benchmark specifically designed for VQS. Furthermore, to stimulate future research, we present a simple yet effective method, named VQ-SAM, which extends SAM 2 by leveraging target-specific and background distractor cues from the video to progressively evolve the memory through a novel multi-stage framework with an adaptive memory generation (AMG) module for VQS, significantly improving the performance. In our extensive experiments on VQS-4K, VQ-SAM achieves promising results and surpasses all existing approaches, demonstrating its effectiveness. With the proposed VQS-4K and VQ-SAM, we expect to go beyond the current VQL paradigm and inspire more future research and practical applications on VQS. Our benchmark, code, and results will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

visual query segmentation

video object segmentation

pixel-level localization

untrimmed video

visual query

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Query Segmentation

VQS-4K

Pixel-level Localization