UGround: Towards Unified Visual Grounding with Unrolled Transformers

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current visual grounding methods suffer from two key limitations: (1) reliance on features exclusively from the final Transformer layer, leading to accumulated and uncorrected inter-layer errors; and (2) insufficient explicit spatial constraints on text embeddings when using <SEG> as a prompt. To address these, we propose UGround—the first framework to “unfold” the Transformer into a multi-layer parallel decoding architecture. It enables cross-layer feature adaptation via policy-guided dynamic intermediate-layer selection and a novel “mask-as-prompt” mechanism. We further introduce a similarity-graph-driven soft logit masking strategy, stochastic skip connections, and tight coupling with SAM to enhance mask quality. UGround unifies modeling across diverse tasks—including referring expression comprehension, reasoning-based segmentation, single/multi-object localization, and handling false premises—achieving state-of-the-art performance on multiple benchmarks. It is the first end-to-end unified architecture for multi-scenario visual grounding.

Technology Category

Application Category

📝 Abstract
We present UGround, a extbf{U}nified visual extbf{Ground}ing paradigm that dynamically selects intermediate layers across extbf{U}nrolled transformers as ``mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as `` exttt{<SEG>} as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of exttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each exttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the exttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.
Problem

Research questions and friction points this paper is trying to address.

Dynamic layer selection addresses cumulative error propagation
Mask prompts provide explicit spatial cues for grounding
Unified framework handles diverse visual grounding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic layer selection via unrolled transformers
Policy-Prompted Masking with stochastic skip connections
Mask as prompt provides explicit spatial cues
🔎 Similar Papers
No similar papers found.
R
Rui Qian
College of Computer Science and Artificial Intelligence, Fudan University
X
Xin Yin
The State Key Laboratory of Blockchain and Data Security, Zhejiang University
C
Chuanhang Deng
College of Computer Science and Artificial Intelligence, Fudan University,BEDI Cloud
Z
Zhiyuan Peng
School of Computer Science, Shanghai Jiao Tong University
Jian Xiong
Jian Xiong
School of Business Administration, Southwestern University of Finance and Economics
Multi-objective evolutionary optimizationMachine learningData MiningDecision support systemsProject planning and schedul
W
Wei Zhai
College of Computer Science and Artificial Intelligence, Fudan University
D
Dejing Dou
College of Computer Science and Artificial Intelligence, Fudan University,BEDI Cloud