Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition

📅 2024-07-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

137K/year

🤖 AI Summary

To address key bottlenecks in vision-language multimodal named entity recognition (GMNER)—namely, ambiguous entity disambiguation, coarse-grained cross-modal alignment, and insufficient global relational modeling—this paper proposes a multi-granularity query-guided set prediction framework. Methodologically: (1) GMNER is reformulated as an end-to-end set prediction task, eliminating conventional sequential decoding and handcrafted query design; (2) a novel intra-/inter-entity dual-level relational modeling mechanism is introduced to explicitly capture both internal entity structure and cross-entity semantic dependencies; (3) a learnable multi-granularity query module and Query-Guided Fusion Network (QFNet) are designed to jointly align textual spans, entity types, and visual regions with fine-grained precision. The framework achieves state-of-the-art performance on mainstream GMNER benchmarks, significantly improving ambiguity resolution (e.g., distinguishing “Jordan” as a brand versus a person) and visual grounding accuracy.

Technology Category

Application Category

📝 Abstract

Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentence-image pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utilizing human-designed type queries, struggles to differentiate ambiguous entities, such as Jordan (Person) and off-White x Jordan (Shoes). The latter, following the one-by-one decoding order, suffers from exposure bias issues. We maintain that these works misunderstand the relationships of multimodal entities. To tackle these, we propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels. Specifically, MQSPN explicitly aligns textual entities with visual regions by employing a set of learnable queries to strengthen intra-entity connections. Based on distinct intra-entity modeling, MQSPN reformulates GMNER as a set prediction, guiding models to establish appropriate inter-entity relationships from a optimal global matching perspective. Additionally, we incorporate a query-guided Fusion Net (QFNet) as a glue network to boost better alignment of two-level relationships. Extensive experiments demonstrate that our approach achieves state-of-the-art performances in widely used benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Named Entity Recognition

Entity Disambiguation

Global Relationship Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-granularity Query-guided Set Prediction Network (MQSPN)

Query Fusion Network (QFNet)

Multimodal Named Entity Recognition (GMNER)

🔎 Similar Papers

Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation