BARE: Towards Bias-Aware and Reasoning-Enhanced One-Tower Visual Grounding

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing single-tower visual grounding methods, which suffer from modality bias due to excessive entanglement of multimodal representations and insufficient semantic reasoning capabilities, leading to inaccurate interpretation of referring expressions. To overcome these challenges, the authors propose the BARE framework, which integrates three synergistic modules—linguistic salience modulation, visual bias correction, and referential relation enhancement—to strengthen referential semantic understanding while preserving modality-specific characteristics. Built upon a single-tower multimodal architecture, BARE achieves state-of-the-art performance across five benchmark datasets, demonstrating not only superior accuracy and robustness in visual grounding but also improved computational efficiency.

Technology Category

Application Category

📝 Abstract
Visual Grounding (VG), which aims to locate a specific region referred to by expressions, is a fundamental yet challenging task in the multimodal understanding fields. While recent grounding transfer works have advanced the field through one-tower architectures, they still suffer from two primary limitations: (1) over-entangled multimodal representations that exacerbate deceptive modality biases, and (2) insufficient semantic reasoning that hinders the comprehension of referential cues. In this paper, we propose BARE, a bias-aware and reasoning-enhanced framework for one-tower visual grounding. BARE introduces a mechanism that preserves modality-specific features and constructs referential semantics through three novel modules: (i) language salience modulator, (ii) visual bias correction and (iii) referential relationship enhancement, which jointly mitigate multimodal distractions and enhance referential comprehension. Extensive experimental results on five benchmarks demonstrate that BARE not only achieves state-of-the-art performance but also delivers superior computational efficiency compared to existing approaches. The code is publicly accessible at https://github.com/Marloweeee/BARE.
Problem

Research questions and friction points this paper is trying to address.

visual grounding
modality bias
semantic reasoning
one-tower architecture
referential comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

bias-aware
reasoning-enhanced
one-tower architecture
visual grounding
modality-specific features
🔎 Similar Papers
No similar papers found.
H
Hongbing Li
School of Artificial Intelligence, Beijing University of Posts and Telecommunications (BUPT), Beijing 100876, China
L
Linhui Xiao
Pengcheng Laboratory, Shenzhen 518066, China, and also with Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing 100190, China
Zihan Zhao
Zihan Zhao
Shanghai Jiao Tong University
NLP
Qi Shen
Qi Shen
Active Materials and Smart Living Laboratory, University of Nevada, Las Vegas
Soft roboticsSmart MaterialsBioinspirationPhysical modelingActuators/Sensors
Yixiang Huang
Yixiang Huang
Beijing University of Posts and Telecommunications
Deep LearningComputer VisionMultimodal Learning
B
Bo Xiao
School of Artificial Intelligence, Beijing University of Posts and Telecommunications (BUPT), Beijing 100876, China
Zhanyu Ma
Zhanyu Ma
Beijing University of Posts and Telecommunications
Pattern RecognitionMachine LearningComputer VisionMultimedia TechnologyDeep Learning