I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

To address performance bottlenecks in multimodal entity linking—specifically, image redundancy and limitations of single-pass visual feature extraction—this paper proposes a text-prioritized, vision-text co-reflective large language model (LLM) framework. The method relies solely on textual cues when sufficient disambiguating information is present; otherwise, it dynamically activates salient visual cues, enabling iterative reasoning via intra-modal self-reflection and cross-modal alignment for fine-grained multimodal fusion. Crucially, we introduce a novel co-reflective mechanism that jointly optimizes efficiency and robustness, effectively suppressing irrelevant image interference and overcoming the constraints of static visual representations. Evaluated on three standard benchmarks, our approach achieves significant improvements over state-of-the-art methods, with average accuracy gains of +3.3% (+3.2%, +5.1%, and +1.6% respectively).

Technology Category

Application Category

📝 Abstract

Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.

Problem

Research questions and friction points this paper is trying to address.

Addresses unnecessary image data use in multimodal entity linking

Improves one-time visual feature extraction limitations in linking

Enhances accuracy by integrating text and iterative visual clues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritizes text information for entity linking

Multi-round iterative visual clue integration

Intra- and inter-modal collaborative evaluations

🔎 Similar Papers

No similar papers found.