Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing large language model–based multimodal information extraction methods, which rely on natural language templates and struggle to effectively model structured outputs. The study formalizes the task for the first time as a code understanding and generation problem, proposing a unified code-style modeling paradigm. Specifically, it takes as input a Python function that integrates entity attributes, image scene graphs, and raw text, and generates a structured dictionary-format output. By leveraging code-style templates, the approach enhances multimodal semantic alignment and enables unified task modeling. The method achieves state-of-the-art performance across multiple benchmarks, including M³D (English/Chinese F1 scores of 61.03%/60.49%), Twitter-15, Twitter-17, and MNRE, with F1 scores ranging from 73.94% to 88.07%.

Technology Category

Application Category

📝 Abstract
With the rapid development of large language models (LLMs), more and more researchers have paid attention to information extraction based on LLMs. However, there are still some spaces to improve in the existing related methods. First, existing multimodal information extraction (MIE) methods usually employ natural language templates as the input and output of LLMs, which mismatch with the characteristics of information tasks that mostly include structured information such as entities and relations. Second, although a few methods have adopted structured and more IE-friendly code-style templates, they just explored their methods on text-only IE rather than multimodal IE. Moreover, their methods are more complex in design, requiring separate templates to be designed for each task. In this paper, we propose a Code-style Multimodal Information Extraction framework (Code-MIE) which formalizes MIE as unified code understanding and generation. Code-MIE has the following novel designs: (1) Entity attributes such as gender, affiliation are extracted from the text to guide the model to understand the context and role of entities. (2) Images are converted into scene graphs and visual features to incorporate rich visual information into the model. (3) The input template is constructed as a Python function, where entity attributes, scene graphs and raw text compose of the function parameters. In contrast, the output template is formalized as Python dictionaries containing all extraction results such as entities, relations, etc. To evaluate Code-MIE, we conducted extensive experiments on the M$^3$D, Twitter-15, Twitter-17, and MNRE datasets. The results show that our method achieves state-of-the-art performance compared to six competing baseline models, with 61.03\% and 60.49\% on the English and Chinese datasets of M$^3$D, and 76.04\%, 88.07\%, and 73.94\% on the other three datasets.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Information Extraction
Code-style Template
Structured Output
Large Language Models
Entity Attributes
Innovation

Methods, ideas, or system contributions that make the work stand out.

code-style prompting
multimodal information extraction
scene graph
entity attribute enhancement
structured output generation
🔎 Similar Papers
No similar papers found.
Jiang Liu
Jiang Liu
Southern University of Science and Technology
眼科人工智能、眼脑联动、医疗影像、精准医疗、手术机器人
G
Ge Qiu
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University
Hao Fei
Hao Fei
National University of Singapore
Vision and LanguageLarge Language ModelNatural Language ProcessingWorld Modeling
D
Dongdong Xie
Wuhan Second Ship Design and Research Institute
J
Jinbo Li
China United Network Communications Co., Ltd. Research Institute
Fei Li
Fei Li
Wuhan University
Natural Language Processing
C
Chong Teng
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University
Donghong Ji
Donghong Ji
Wuhan University
Artificial IntelligenceNatural Language Processing