Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Multimodal large language models (MLLMs) face significant bottlenecks in understanding multimodal abstract relational knowledge (MMRK)—i.e., abstract semantic relationships among multimodal entities, structured as node-edge graphs—rendering structured abstract reasoning (STAR) over such knowledge an unexplored task. Method: We introduce STAR as the first systematic benchmark, accompanied by an automated STAR data engine and a two-stage capability-enhancement training framework that jointly supports MMRK modeling, generation, and reasoning. Our approach explicitly encodes multimodal relations in node-edge format, integrating synthetic data generation, multimodal instruction tuning, and a customized evaluation protocol. Contribution/Results: We release STAR-64K, the first large-scale STAR dataset (64K samples). Experiments show that even compact 3B/7B-parameter models trained with our framework substantially outperform GPT-4o on STAR tasks, validating both the effectiveness and scalability of structured abstract reasoning as a paradigm.

Technology Category

Application Category

📝 Abstract

Understanding and reasoning with abstractive information from the visual modality presents significant challenges for current multi-modal large language models (MLLMs). Among the various forms of abstractive information, Multi-Modal Relational Knowledge (MMRK), which represents abstract relational structures between multi-modal entities using node-edge formats, remains largely under-explored. In particular, STructured and Abstractive Reasoning (STAR) on such data has received little attention from the research community. To bridge the dual gaps in large-scale high-quality data and capability enhancement methodologies, this paper makes the following key contributions: (i). An automatic STAR data engine capable of synthesizing images with MMRK to build multi-modal instruction data with reliable chain-of-thought thinking for various STAR tasks and (ii). A comprehsive two-stage capability enhancement training framework, accompanied by a suite of evaluation protocols tailored to different STAR tasks. Based upon these contributions, we introduce STAR-64K, a dataset comprising 64K high-quality multi-modal instruction samples, and conduct experiments across 5 open-source MLLMs. Experimental results show that our two-stage enhancement framework enables smaller 3B/7B models to significantly outperform GPT-4o in STAR. Additionally, we provide in-depth analysis regarding the effectiveness of various designs, data transferability, and scalability.

Problem

Research questions and friction points this paper is trying to address.

Addressing abstract reasoning challenges in multi-modal relational knowledge

Developing automatic data synthesis for structured multi-modal instruction

Enhancing small models to outperform large models in structured reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic data engine synthesizes MMRK images

Two-stage training framework enhances reasoning capability

STAR-64K dataset provides 64K instruction samples

🔎 Similar Papers

No similar papers found.