E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

126K/year

🤖 AI Summary

This work addresses the limitations of traditional pipeline approaches in grounded multimodal named entity recognition (GMNER), which suffer from error propagation and insufficient joint optimization due to modular fragmentation. To overcome these issues, the authors propose the first end-to-end generative multimodal large language model framework. This framework unifies text-based entity recognition, semantic classification, visual grounding, and implicit knowledge reasoning through instruction tuning, and introduces two key innovations: a chain-of-thought reasoning mechanism and a Gaussian Risk-aware Bounding Box Perturbation (GRBP) strategy to enhance robustness against annotation noise and discretization errors. Extensive experiments on the Twitter-GMNER and Twitter-FMNERG benchmarks demonstrate state-of-the-art performance, validating the effectiveness of end-to-end optimization and noise-aware grounding supervision.

Technology Category

Application Category

📝 Abstract

Grounded Multimodal Named Entity Recognition (GMNER) aims to jointly identify named entity mentions in text, predict their semantic types, and ground each entity to a corresponding visual region in an associated image. Existing approaches predominantly adopt pipeline-based architectures that decouple textual entity recognition and visual grounding, leading to error accumulation and suboptimal joint optimization. In this paper, we propose E2E-GMNER, a fully end-to-end generative framework that unifies entity recognition, semantic typing, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. We formulate GMNER as an instruction-tuned conditional generation task and incorporate chain-of-thought reasoning to enable the model to adaptively determine when visual evidence or background knowledge is informative, reducing reliance on noisy cues. To further address the instability of generative bounding box prediction, we introduce Gaussian Risk-Aware Box Perturbation (GRBP), which replaces hard box supervision with probabilistically perturbed soft targets to improve robustness against annotation noise and discretization errors. Extensive experiments on the Twitter-GMNER and Twitter-FMNERG benchmarks demonstrate that E2E-GMNER achieves highly competitive performance compared with state of the art methods, validating the effectiveness of unified end-to-end optimization and noise-aware grounding supervision. Code is available at:https://github.com/Finch-coder/E2E-GMNER

Problem

Research questions and friction points this paper is trying to address.

Grounded Multimodal Named Entity Recognition

End-to-End Learning

Visual Grounding

Multimodal Entity Recognition

Noise-Robust Grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end generative framework

multimodal named entity recognition

chain-of-thought reasoning