🤖 AI Summary
This work addresses the challenges of multimodal named entity recognition and visual grounding in open-world social media, where long-tailed distributions, rapid concept evolution, and unseen entities hinder performance. To tackle these issues, the authors propose SAKE, a framework featuring a self-aware reasoning mechanism that dynamically balances internal knowledge utilization and external knowledge exploration within multimodal large language models. SAKE introduces a novel uncertainty-quantified knowledge gap signal to autonomously decide when to invoke retrieval tools. It employs a two-stage training strategy: first generating difficulty-aware search labels to construct a chain-of-thought dataset, then applying agent-based reinforcement learning with hybrid rewards. Experiments demonstrate that SAKE significantly outperforms existing methods on two mainstream social media benchmarks, achieving a strong balance between accuracy on known entities and generalization to unseen ones.
📝 Abstract
Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model's entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness.