🤖 AI Summary
To address key bottlenecks in vision-language multimodal named entity recognition (GMNER)—namely, ambiguous entity disambiguation, coarse-grained cross-modal alignment, and insufficient global relational modeling—this paper proposes a multi-granularity query-guided set prediction framework. Methodologically: (1) GMNER is reformulated as an end-to-end set prediction task, eliminating conventional sequential decoding and handcrafted query design; (2) a novel intra-/inter-entity dual-level relational modeling mechanism is introduced to explicitly capture both internal entity structure and cross-entity semantic dependencies; (3) a learnable multi-granularity query module and Query-Guided Fusion Network (QFNet) are designed to jointly align textual spans, entity types, and visual regions with fine-grained precision. The framework achieves state-of-the-art performance on mainstream GMNER benchmarks, significantly improving ambiguity resolution (e.g., distinguishing “Jordan” as a brand versus a person) and visual grounding accuracy.
📝 Abstract
Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentence-image pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utilizing human-designed type queries, struggles to differentiate ambiguous entities, such as Jordan (Person) and off-White x Jordan (Shoes). The latter, following the one-by-one decoding order, suffers from exposure bias issues. We maintain that these works misunderstand the relationships of multimodal entities. To tackle these, we propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels. Specifically, MQSPN explicitly aligns textual entities with visual regions by employing a set of learnable queries to strengthen intra-entity connections. Based on distinct intra-entity modeling, MQSPN reformulates GMNER as a set prediction, guiding models to establish appropriate inter-entity relationships from a optimal global matching perspective. Additionally, we incorporate a query-guided Fusion Net (QFNet) as a glue network to boost better alignment of two-level relationships. Extensive experiments demonstrate that our approach achieves state-of-the-art performances in widely used benchmarks.