Mario: Multimodal Graph Reasoning with Large Language Models

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing methods for multimodal graph data often neglect the inherent graph structure, struggling to simultaneously preserve topological fidelity and enable effective reasoning by large language models over heterogeneous signals. To this end, we propose Mario, a novel framework that integrates a graph-conditioned vision-language model, fine-grained cross-modal contrastive learning guided by graph topology, a modality-adaptive graph instruction tuning mechanism, and a learnable routing strategy. These components jointly enable dynamic optimization of modality configurations to enhance reasoning capabilities. Extensive experiments demonstrate that Mario significantly outperforms current approaches across multiple multimodal graph benchmarks, achieving state-of-the-art performance in both node classification and link prediction tasks, while remaining effective in both supervised and zero-shot settings.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.
Problem

Research questions and friction points this paper is trying to address.

multimodal graph reasoning
cross-modal consistency
heterogeneous modality preference
large language models
graph topology
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal graph reasoning
large language models
graph-conditioned VLM
modality-adaptive instruction tuning
cross-modal contrastive learning
🔎 Similar Papers
No similar papers found.