DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Multimodal Entity Linking (MEL) faces challenges including insufficient contextual modeling, coarse-grained cross-modal fusion, and difficulties in leveraging large language models (LLMs) and large vision models (LVMs) synergistically. To address these, this paper proposes a role-specialized multi-agent collaborative framework that uniformly formulates MEL as a structured cloze-filling task. It introduces a text–vision dual-modality alignment pathway and an adaptive iterative disambiguation strategy; integrates fine-grained semantic descriptions generated by LLMs with image-structural representations extracted by LVMs; and incorporates a tool-augmented retrieval–reasoning joint optimization mechanism for dynamic candidate entity refinement. The method achieves state-of-the-art performance on five public benchmarks, improving accuracy by 1%–57%. Ablation studies validate the effectiveness of multi-agent collaboration, dual-modality alignment, and structured prompting.

Technology Category

Application Category

📝 Abstract

Multimodal Entity Linking (MEL) aims to associate textual and visual mentions with entities in a multimodal knowledge graph. Despite its importance, current methods face challenges such as incomplete contextual information, coarse cross-modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs). To address these issues, we propose DeepMEL, a novel framework based on multi-agent collaborative reasoning, which achieves efficient alignment and disambiguation of textual and visual modalities through a role-specialized division strategy. DeepMEL integrates four specialized agents, namely Modal-Fuser, Candidate-Adapter, Entity-Clozer and Role-Orchestrator, to complete end-to-end cross-modal linking through specialized roles and dynamic coordination. DeepMEL adopts a dual-modal alignment path, and combines the fine-grained text semantics generated by the LLM with the structured image representation extracted by the LVM, significantly narrowing the modal gap. We design an adaptive iteration strategy, combines tool-based retrieval and semantic reasoning capabilities to dynamically optimize the candidate set and balance recall and precision. DeepMEL also unifies MEL tasks into a structured cloze prompt to reduce parsing complexity and enhance semantic comprehension. Extensive experiments on five public benchmark datasets demonstrate that DeepMEL achieves state-of-the-art performance, improving ACC by 1%-57%. Ablation studies verify the effectiveness of all modules.

Problem

Research questions and friction points this paper is trying to address.

Addresses incomplete contextual information in multimodal entity linking

Resolves coarse cross-modal fusion between text and visual data

Integrates large language models and large visual models effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent collaborative reasoning framework

Dual-modal alignment combining LLM and LVM

Structured cloze prompt unification technique

🔎 Similar Papers

No similar papers found.