AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

Existing reasoning-based segmentation methods struggle to accurately map complex implicit textual queries to pixel-level masks due to their reliance on a single segmentation token, and they fail to explicitly decouple semantic reasoning from spatial localization. This work proposes a structured conditional generation framework that constructs an ordered query sequence via a language-anchored query bank, separately modeling intermediate semantic states and explicit spatial anchors. Furthermore, a Token–Mask Cycle Consistency (TMCC) training mechanism is introduced to enforce multi-resolution alignment. The proposed approach is the first to explicitly disentangle semantic reasoning and spatial localization, achieving state-of-the-art performance on the ReasonSeg benchmark with a gIoU of 67.7% and a cIoU of 68.1%.

Technology Category

Application Category

📝 Abstract

Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt{<SEG>}$, whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model's ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture intermediate semantic states, and a segmentation anchor token that provides explicit spatial grounding. We model spatial conditioning as a factorized distribution over image tokens, where the anchor query determines localization signals while contextual queries provide semantic modulation. To bridge token-level predictions and pixel-level supervision, we propose Token--Mask Cycle Consistency (TMCC), a bidirectional training objective that enforces alignment across resolutions. By explicitly decoupling spatial grounding from semantic reasoning through structured language grounded query banks, AnchorSeg achieves state-of-the-art results on ReasonSeg test set (67.7\% gIoU and 68.1\% cIoU). All code and models are publicly available at https://github.com/rui-qian/AnchorSeg.

Problem

Research questions and friction points this paper is trying to address.

reasoning segmentation

language grounding

semantic reasoning

spatial localization

pixel-level segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning segmentation

language grounded query banks

spatial grounding