DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal Entity Linking (MEL) faces challenges including insufficient contextual modeling, coarse-grained cross-modal fusion, and difficulties in leveraging large language models (LLMs) and large vision models (LVMs) synergistically. To address these, this paper proposes a role-specialized multi-agent collaborative framework that uniformly formulates MEL as a structured cloze-filling task. It introduces a text–vision dual-modality alignment pathway and an adaptive iterative disambiguation strategy; integrates fine-grained semantic descriptions generated by LLMs with image-structural representations extracted by LVMs; and incorporates a tool-augmented retrieval–reasoning joint optimization mechanism for dynamic candidate entity refinement. The method achieves state-of-the-art performance on five public benchmarks, improving accuracy by 1%–57%. Ablation studies validate the effectiveness of multi-agent collaboration, dual-modality alignment, and structured prompting.

Technology Category

Application Category

📝 Abstract
Multimodal Entity Linking (MEL) aims to associate textual and visual mentions with entities in a multimodal knowledge graph. Despite its importance, current methods face challenges such as incomplete contextual information, coarse cross-modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs). To address these issues, we propose DeepMEL, a novel framework based on multi-agent collaborative reasoning, which achieves efficient alignment and disambiguation of textual and visual modalities through a role-specialized division strategy. DeepMEL integrates four specialized agents, namely Modal-Fuser, Candidate-Adapter, Entity-Clozer and Role-Orchestrator, to complete end-to-end cross-modal linking through specialized roles and dynamic coordination. DeepMEL adopts a dual-modal alignment path, and combines the fine-grained text semantics generated by the LLM with the structured image representation extracted by the LVM, significantly narrowing the modal gap. We design an adaptive iteration strategy, combines tool-based retrieval and semantic reasoning capabilities to dynamically optimize the candidate set and balance recall and precision. DeepMEL also unifies MEL tasks into a structured cloze prompt to reduce parsing complexity and enhance semantic comprehension. Extensive experiments on five public benchmark datasets demonstrate that DeepMEL achieves state-of-the-art performance, improving ACC by 1%-57%. Ablation studies verify the effectiveness of all modules.
Problem

Research questions and friction points this paper is trying to address.

Addresses incomplete contextual information in multimodal entity linking
Resolves coarse cross-modal fusion between text and visual data
Integrates large language models and large visual models effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent collaborative reasoning framework
Dual-modal alignment combining LLM and LVM
Structured cloze prompt unification technique
🔎 Similar Papers
No similar papers found.
Fang Wang
Fang Wang
Postdoc, Stanford University
Reading acquisitiondyslexiacross-linguistic researchbilingualismcognitive neuroscience
T
Tianwei Yan
School of Information Science and Engineering, Chongqing Jiaotong University, No.66 Xuefu Road, Chongqing, 400074, China
Z
Zonghao Yang
China Research and Development Academy of Machinery Equipment, No. 10 Courtyard Road, Beijing, 100089, China
M
Minghao Hu
Center of Information Research, Academy of Military Science, No.26 Fucheng Road, Beijing, 100142, China
J
Jun Zhang
Center of Information Research, Academy of Military Science, No.26 Fucheng Road, Beijing, 100142, China
Zhunchen Luo
Zhunchen Luo
Unknown affiliation
Xiaoying Bai
Xiaoying Bai
Tsinghua University
Software engineeringsoftware testingservice-oriented computingcloud computing