Towards Visual Query Segmentation in the Wild

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a novel paradigm, Visual Query Segmentation (VQS), to address the limitations of existing visual query localization methods that typically annotate only the last occurrence of a target with a bounding box, which is insufficient for fine-grained understanding. VQS aims to perform pixel-level segmentation of all instances of a target object in untrimmed videos based on an external visual query. To facilitate research in this direction, we introduce VQS-4K, the first large-scale benchmark dataset for this task, and present VQ-SAM, a method built upon the SAM 2 architecture. VQ-SAM incorporates an Adaptive Memory Generation (AMG) module that fuses target-specific cues through multi-stage memory evolution while suppressing background distractions. Experiments demonstrate that VQ-SAM significantly outperforms existing approaches on VQS-4K, validating the effectiveness of the proposed paradigm and offering more comprehensive and precise visual query localization for real-world applications.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce visual query segmentation (VQS), a new paradigm of visual query localization (VQL) that aims to segment all pixel-level occurrences of an object of interest in an untrimmed video, given an external visual query. Compared to existing VQL locating only the last appearance of a target using bounding boxes, VQS enables more comprehensive (i.e., all object occurrences) and precise (i.e., pixel-level masks) localization, making it more practical for real-world scenarios. To foster research on this task, we present VQS-4K, a large-scale benchmark dedicated to VQS. Specifically, VQS-4K contains 4,111 videos with more than 1.3 million frames and covers a diverse set of 222 object categories. Each video is paired with a visual query defined by a frame outside the search video and its target mask, and annotated with spatial-temporal masklets corresponding to the queried target. To ensure high quality, all videos in VQS-4K are manually labeled with meticulous inspection and iterative refinement. To the best of our knowledge, VQS-4K is the first benchmark specifically designed for VQS. Furthermore, to stimulate future research, we present a simple yet effective method, named VQ-SAM, which extends SAM 2 by leveraging target-specific and background distractor cues from the video to progressively evolve the memory through a novel multi-stage framework with an adaptive memory generation (AMG) module for VQS, significantly improving the performance. In our extensive experiments on VQS-4K, VQ-SAM achieves promising results and surpasses all existing approaches, demonstrating its effectiveness. With the proposed VQS-4K and VQ-SAM, we expect to go beyond the current VQL paradigm and inspire more future research and practical applications on VQS. Our benchmark, code, and results will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

visual query segmentation
video object segmentation
pixel-level localization
untrimmed video
visual query
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Query Segmentation
VQS-4K
Pixel-level Localization
Adaptive Memory Generation
SAM 2
🔎 Similar Papers
No similar papers found.
B
Bing Fan
Department of Computer Science and Engineering, University of North Texas
Minghao Li
Minghao Li
Beihang University
Natural Language Processing
H
Hanzhi Zhang
Department of Computer Science and Engineering, University of North Texas
Shaohua Dong
Shaohua Dong
University of North Texas
Computer Vision
N
Naga Prudhvi Mareedu
Department of Computer Science and Engineering, University of North Texas
Weishi Shi
Weishi Shi
University of North Texas
Data miningMachine learningActive learning.
Yunhe Feng
Yunhe Feng
Assistant Professor at University of North Texas
Responsible AIEfficient Generative AIData Security and PrivacyApplied AI
Y
Yan Huang
Department of Computer Science and Engineering, University of North Texas
Heng Fan
Heng Fan
Assistant Professor, University of North Texas
Computer VisionMachine LearningArtificial Intelligence