Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

📅 2024-06-11
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Addressing two key challenges in social media—weak image-text alignment hindering visual grounding of named entities, and the fine-grained nature of entities (finer than conventional noun phrases)—this paper proposes RiVEG, a unified framework. It introduces the first LLM-driven task reformulation paradigm for Grounded Multimodal Named Entity Recognition (GMNER), decomposing it into three jointly optimized subtasks: Multimodal Named Entity Recognition (MNER), Visual Entailment (VE), and Visual Grounding (VG). Extending this paradigm, we formalize a novel fine-grained segmentation task—Segmentation-based Multimodal NER (SMNER)—and release the first benchmark dataset, Twitter-SMNER. To achieve mask-level entity localization, we integrate Box-prompted Segment Anything Model (SAM). RiVEG achieves state-of-the-art performance across four benchmarks, with significant improvements on MNER, GMNER, and SMNER. Notably, it is the first work to demonstrate effective transfer of GMNER models to pixel-accurate segmentation.

Technology Category

Application Category

📝 Abstract
Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and text on social media contributes to a notable proportion of named entities being ungroundable. 2) There exists a distinction between coarse-grained noun phrases used in similar tasks (e.g., phrase localization) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as connecting bridges. This reformulation brings two benefits: 1) It enables us to optimize the MNER module for optimal MNER performance and eliminates the need to pre-extract region features using object detection methods, thus naturally addressing the two major limitations of existing GMNER methods. 2) The introduction of Entity Expansion Expression module and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). This endows the proposed framework with unlimited data and model scalability. Furthermore, to address the potential ambiguity stemming from the coarse-grained bounding box output in GMNER, we further construct the new Segmented Multimodal Named Entity Recognition (SMNER) task and corresponding Twitter-SMNER dataset aimed at generating fine-grained segmentation masks, and experimentally demonstrate the feasibility and effectiveness of using box prompt-based Segment Anything Model (SAM) to empower any GMNER model with the ability to accomplish the SMNER task. Extensive experiments demonstrate that RiVEG significantly outperforms SoTA methods on four datasets across the MNER, GMNER, and SMNER tasks.
Problem

Research questions and friction points this paper is trying to address.

Identifying named entities and visual regions in multimodal data
Addressing ungroundable entities due to weak image-text correlation
Generating fine-grained segmentation masks instead of coarse bounding boxes
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM reformulation bridges MNER-VE-VG tasks jointly
Entity Expansion unifies Visual and Entity Grounding
Box-based SAM enables fine-grained segmentation masks
🔎 Similar Papers
No similar papers found.