Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement

📅 2026-03-21

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the limitations of existing large language model–based multimodal information extraction methods, which rely on natural language templates and struggle to effectively model structured outputs. The study formalizes the task for the first time as a code understanding and generation problem, proposing a unified code-style modeling paradigm. Specifically, it takes as input a Python function that integrates entity attributes, image scene graphs, and raw text, and generates a structured dictionary-format output. By leveraging code-style templates, the approach enhances multimodal semantic alignment and enables unified task modeling. The method achieves state-of-the-art performance across multiple benchmarks, including M³D (English/Chinese F1 scores of 61.03%/60.49%), Twitter-15, Twitter-17, and MNRE, with F1 scores ranging from 73.94% to 88.07%.

Technology Category

Application Category

📝 Abstract

With the rapid development of large language models (LLMs), more and more researchers have paid attention to information extraction based on LLMs. However, there are still some spaces to improve in the existing related methods. First, existing multimodal information extraction (MIE) methods usually employ natural language templates as the input and output of LLMs, which mismatch with the characteristics of information tasks that mostly include structured information such as entities and relations. Second, although a few methods have adopted structured and more IE-friendly code-style templates, they just explored their methods on text-only IE rather than multimodal IE. Moreover, their methods are more complex in design, requiring separate templates to be designed for each task. In this paper, we propose a Code-style Multimodal Information Extraction framework (Code-MIE) which formalizes MIE as unified code understanding and generation. Code-MIE has the following novel designs: (1) Entity attributes such as gender, affiliation are extracted from the text to guide the model to understand the context and role of entities. (2) Images are converted into scene graphs and visual features to incorporate rich visual information into the model. (3) The input template is constructed as a Python function, where entity attributes, scene graphs and raw text compose of the function parameters. In contrast, the output template is formalized as Python dictionaries containing all extraction results such as entities, relations, etc. To evaluate Code-MIE, we conducted extensive experiments on the M$^3$D, Twitter-15, Twitter-17, and MNRE datasets. The results show that our method achieves state-of-the-art performance compared to six competing baseline models, with 61.03\% and 60.49\% on the English and Chinese datasets of M$^3$D, and 76.04\%, 88.07\%, and 73.94\% on the other three datasets.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Information Extraction

Code-style Template

Structured Output

Large Language Models

Entity Attributes

Innovation

Methods, ideas, or system contributions that make the work stand out.

code-style prompting

multimodal information extraction

scene graph