ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion

๐Ÿ“… 2025-10-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address challenges in multimodal knowledge graph completion (MKGC)โ€”including semantic noise in image tokens, cross-modal conflicts, and high computational overhead when deploying multimodal large language models (MLLMs)โ€”this paper proposes ELMM, a lightweight MLLM framework. Its core innovation is the Multi-View Token Compressor (MVTC), which adaptively compresses visual tokens from both textual and visual perspectives, integrating attention pruning with a linear projection recovery mechanism to preserve semantic consistency while drastically reducing inference cost. Evaluated on FB15k-237-IMG and WN18-IMG benchmarks, ELMM achieves state-of-the-art performance with fewer parameters and lower latency. It marks the first efficient and robust deployment of MLLMs for MKGC, establishing a scalable new paradigm for multimodal knowledge graph completion.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal Knowledge Graphs (MKGs) extend traditional knowledge graphs by incorporating visual and textual modalities, enabling richer and more expressive entity representations. However, existing MKGs often suffer from incompleteness, which hinder their effectiveness in downstream tasks. Therefore, multimodal knowledge graph completion (MKGC) task is receiving increasing attention. While large language models (LLMs) have shown promise for knowledge graph completion (KGC), their application to the multimodal setting remains underexplored. Moreover, applying Multimodal Large Language Models (MLLMs) to the task of MKGC introduces significant challenges: (1) the large number of image tokens per entity leads to semantic noise and modality conflicts, and (2) the high computational cost of processing large token inputs. To address these issues, we propose Efficient Lightweight Multimodal Large Language Models (ELMM) for MKGC. ELMM proposes a Multi-view Visual Token Compressor (MVTC) based on multi-head attention mechanism, which adaptively compresses image tokens from both textual and visual views, thereby effectively reducing redundancy while retaining necessary information and avoiding modality conflicts. Additionally, we design an attention pruning strategy to remove redundant attention layers from MLLMs, thereby significantly reducing the inference cost. We further introduce a linear projection to compensate for the performance degradation caused by pruning. Extensive experiments on benchmark FB15k-237-IMG and WN18-IMG demonstrate that ELMM achieves state-of-the-art performance while substantially improving computational efficiency, establishing a new paradigm for multimodal knowledge graph completion.
Problem

Research questions and friction points this paper is trying to address.

Addressing multimodal knowledge graph incompleteness using lightweight large language models
Reducing semantic noise and modality conflicts from excessive image tokens
Minimizing computational costs while maintaining MKGC performance through efficient compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view token compressor reduces redundancy and conflicts
Attention pruning strategy cuts computational costs significantly
Linear projection compensates performance loss from pruning
๐Ÿ”Ž Similar Papers
No similar papers found.
W
Wei Huang
School of Computer Science, Beijing University of Posts and Telecommunications
P
Peining Li
School of Computer Science, Beijing University of Posts and Telecommunications
M
Meiyu Liang
School of Computer Science, Beijing University of Posts and Telecommunications
Xu Hou
Xu Hou
Professor of Xiamen University
bio-inspired design of advanced materialschemical modificationbiomedical engineeringmicrofluidicsmembrane science
Junping Du
Junping Du
Beijing University of Posts and Telecommunications
Yingxia Shao
Yingxia Shao
SCS, BUPT
Large-scale Graph AnalysisGraph Data ManagementGraph Learning
G
Guanhua Ye
School of Computer Science, Beijing University of Posts and Telecommunications
W
Wu Liu
School of Computer Science, Beijing University of Posts and Telecommunications
K
Kangkang Lu
School of Computer Science, Beijing University of Posts and Telecommunications
Y
Yang Yu
School of Computer Science, Beijing University of Posts and Telecommunications