I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance bottlenecks in multimodal entity linking—specifically, image redundancy and limitations of single-pass visual feature extraction—this paper proposes a text-prioritized, vision-text co-reflective large language model (LLM) framework. The method relies solely on textual cues when sufficient disambiguating information is present; otherwise, it dynamically activates salient visual cues, enabling iterative reasoning via intra-modal self-reflection and cross-modal alignment for fine-grained multimodal fusion. Crucially, we introduce a novel co-reflective mechanism that jointly optimizes efficiency and robustness, effectively suppressing irrelevant image interference and overcoming the constraints of static visual representations. Evaluated on three standard benchmarks, our approach achieves significant improvements over state-of-the-art methods, with average accuracy gains of +3.3% (+3.2%, +5.1%, and +1.6% respectively).

Technology Category

Application Category

📝 Abstract
Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.
Problem

Research questions and friction points this paper is trying to address.

Addresses unnecessary image data use in multimodal entity linking
Improves one-time visual feature extraction limitations in linking
Enhances accuracy by integrating text and iterative visual clues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritizes text information for entity linking
Multi-round iterative visual clue integration
Intra- and inter-modal collaborative evaluations
🔎 Similar Papers
No similar papers found.
Z
Ziyan Liu
East China University of Science and Technology
J
Junwen Li
East China University of Science and Technology
K
Kaiwen Li
South China University of Technology
Tong Ruan
Tong Ruan
East China University of Science and Technology
Clinical NLPLLMKG
C
Chao Wang
Shanghai University
X
Xinyan He
Meituan
Zongyu Wang
Zongyu Wang
Henkel Corporation, Oak Ridge National Laboratory, Carnegie Mellon University
polymer synthesisnanocomposites characterization
Xuezhi Cao
Xuezhi Cao
Meituan
Data MiningKnowledge GraphLLMs
Jingping Liu
Jingping Liu
ECUST
large language modelknowledge graph