AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

πŸ“… 2026-04-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

164K/year
πŸ€– AI Summary
Existing reasoning-based segmentation methods struggle to accurately map complex implicit textual queries to pixel-level masks due to their reliance on a single segmentation token, and they fail to explicitly decouple semantic reasoning from spatial localization. This work proposes a structured conditional generation framework that constructs an ordered query sequence via a language-anchored query bank, separately modeling intermediate semantic states and explicit spatial anchors. Furthermore, a Token–Mask Cycle Consistency (TMCC) training mechanism is introduced to enforce multi-resolution alignment. The proposed approach is the first to explicitly disentangle semantic reasoning and spatial localization, achieving state-of-the-art performance on the ReasonSeg benchmark with a gIoU of 67.7% and a cIoU of 68.1%.

Technology Category

Application Category

πŸ“ Abstract
Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt{<SEG>}$, whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model's ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture intermediate semantic states, and a segmentation anchor token that provides explicit spatial grounding. We model spatial conditioning as a factorized distribution over image tokens, where the anchor query determines localization signals while contextual queries provide semantic modulation. To bridge token-level predictions and pixel-level supervision, we propose Token--Mask Cycle Consistency (TMCC), a bidirectional training objective that enforces alignment across resolutions. By explicitly decoupling spatial grounding from semantic reasoning through structured language grounded query banks, AnchorSeg achieves state-of-the-art results on ReasonSeg test set (67.7\% gIoU and 68.1\% cIoU). All code and models are publicly available at https://github.com/rui-qian/AnchorSeg.
Problem

Research questions and friction points this paper is trying to address.

reasoning segmentation
language grounding
semantic reasoning
spatial localization
pixel-level segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning segmentation
language grounded query banks
spatial grounding
semantic reasoning
Token-Mask Cycle Consistency
R
Rui Qian
College of Computer Science and Artificial Intelligence, Fudan University
C
Chuanhang Deng
College of Computer Science and Artificial Intelligence, Fudan University; BEDI Cloud
Qiang Huang
Qiang Huang
Senior Lecturer, School of Computer Science, University of Sunderland
speech processingaudio signal analysisnatural language processingmultimodal information processing
Jian Xiong
Jian Xiong
School of Business Administration, Southwestern University of Finance and Economics
Multi-objective evolutionary optimizationMachine learningData MiningDecision support systemsProject planning and schedul
M
Mingxuan Li
College of Computer Science and Artificial Intelligence, Fudan University
Yingbo Zhou
Yingbo Zhou
Senior Research Director, Salesforce Research
Deep LearningMachine LearningLarge Language ModelingRepresentation LearningMultimodal
W
Wei Zhai
College of Computer Science and Artificial Intelligence, Fudan University
J
Jintao Chen
College of Computer Science and Artificial Intelligence, Fudan University
D
Dejing Dou
College of Computer Science and Artificial Intelligence, Fudan University; BEDI Cloud