AlignCAT: Visual-Linguistic Alignment of Category and Attributefor Weakly Supervised Visual Grounding

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Weakly supervised visual grounding (VG) suffers from insufficient cross-modal alignment accuracy between text and images, particularly struggling to distinguish fine-grained semantic differences at both category and attribute levels. To address this, we propose AlignCAT—a novel framework featuring dual-path cross-modal alignment: a coarse-grained (category-level) path leveraging global contextual cues to mitigate category ambiguity, and a fine-grained (token-level attribute) path employing query-based semantic matching to align linguistic tokens with corresponding visual regions while modeling attribute consistency. Additionally, contrastive learning is integrated to enhance alignment robustness. Evaluated on three standard benchmarks—RefCOCO, RefCOCO+, and RefCOCOg—AlignCAT achieves state-of-the-art performance on both referring expression grounding and foundational VG tasks. These results empirically validate its effectiveness in fine-grained semantic disentanglement and precise cross-modal alignment.

Technology Category

Application Category

📝 Abstract

Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.

Problem

Research questions and friction points this paper is trying to address.

Addresses weakly supervised visual grounding ambiguity in text expressions

Enhances cross-modal reasoning for category and attribute alignment

Improves visual-linguistic alignment via coarse and fine-grained modules

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-grained alignment using category information

Fine-grained alignment leveraging descriptive information

Progressive filtering of misaligned visual queries

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling