Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) face significant bottlenecks in understanding multimodal abstract relational knowledge (MMRK)—i.e., abstract semantic relationships among multimodal entities, structured as node-edge graphs—rendering structured abstract reasoning (STAR) over such knowledge an unexplored task. Method: We introduce STAR as the first systematic benchmark, accompanied by an automated STAR data engine and a two-stage capability-enhancement training framework that jointly supports MMRK modeling, generation, and reasoning. Our approach explicitly encodes multimodal relations in node-edge format, integrating synthetic data generation, multimodal instruction tuning, and a customized evaluation protocol. Contribution/Results: We release STAR-64K, the first large-scale STAR dataset (64K samples). Experiments show that even compact 3B/7B-parameter models trained with our framework substantially outperform GPT-4o on STAR tasks, validating both the effectiveness and scalability of structured abstract reasoning as a paradigm.

Technology Category

Application Category

📝 Abstract
Understanding and reasoning with abstractive information from the visual modality presents significant challenges for current multi-modal large language models (MLLMs). Among the various forms of abstractive information, Multi-Modal Relational Knowledge (MMRK), which represents abstract relational structures between multi-modal entities using node-edge formats, remains largely under-explored. In particular, STructured and Abstractive Reasoning (STAR) on such data has received little attention from the research community. To bridge the dual gaps in large-scale high-quality data and capability enhancement methodologies, this paper makes the following key contributions: (i). An automatic STAR data engine capable of synthesizing images with MMRK to build multi-modal instruction data with reliable chain-of-thought thinking for various STAR tasks and (ii). A comprehsive two-stage capability enhancement training framework, accompanied by a suite of evaluation protocols tailored to different STAR tasks. Based upon these contributions, we introduce STAR-64K, a dataset comprising 64K high-quality multi-modal instruction samples, and conduct experiments across 5 open-source MLLMs. Experimental results show that our two-stage enhancement framework enables smaller 3B/7B models to significantly outperform GPT-4o in STAR. Additionally, we provide in-depth analysis regarding the effectiveness of various designs, data transferability, and scalability.
Problem

Research questions and friction points this paper is trying to address.

Addressing abstract reasoning challenges in multi-modal relational knowledge
Developing automatic data synthesis for structured multi-modal instruction
Enhancing small models to outperform large models in structured reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic data engine synthesizes MMRK images
Two-stage training framework enhances reasoning capability
STAR-64K dataset provides 64K instruction samples
🔎 Similar Papers
No similar papers found.
Y
Yichi Zhang
College of Computer Science and Technology, Zhejiang University
Z
Zhuo Chen
College of Computer Science and Technology, Zhejiang University
Lingbing Guo
Lingbing Guo
Tianjin University
Machine learningArtificial Intelligence
Lei Liang
Lei Liang
Ant Group
Knowledge GraphAI
W
Wen Zhang
School of Software Technology, Zhejiang University
H
Huajun Chen
College of Computer Science and Technology, Zhejiang University