🤖 AI Summary
To address performance bottlenecks in multimodal entity linking—specifically, image redundancy and limitations of single-pass visual feature extraction—this paper proposes a text-prioritized, vision-text co-reflective large language model (LLM) framework. The method relies solely on textual cues when sufficient disambiguating information is present; otherwise, it dynamically activates salient visual cues, enabling iterative reasoning via intra-modal self-reflection and cross-modal alignment for fine-grained multimodal fusion. Crucially, we introduce a novel co-reflective mechanism that jointly optimizes efficiency and robustness, effectively suppressing irrelevant image interference and overcoming the constraints of static visual representations. Evaluated on three standard benchmarks, our approach achieves significant improvements over state-of-the-art methods, with average accuracy gains of +3.3% (+3.2%, +5.1%, and +1.6% respectively).
📝 Abstract
Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.