SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes

📅 2026-01-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of precise visual grounding in multimodal large language models, which often suffer from entity hallucination, relational misalignment, or skipped reasoning steps during complex visual reasoning. To tackle this, the paper introduces a novel approach that constructs structured hard-negative reasoning paths via scene graphs, simulating four types of visual grounding failures through controlled structural interventions. By integrating Direct Preference Optimization (DPO) with explicit supervision over the reasoning process, the method departs from conventional preference alignment paradigms that rely on textual perturbations or answer-conditioned signals. This enables fine-grained, structurally faithful multimodal reasoning. Evaluated across seven visual reasoning benchmarks, the proposed model demonstrates significant improvements in both answer accuracy and reasoning faithfulness, validating the effectiveness of the grounding-aware alignment mechanism.

Technology Category

Application Category

📝 Abstract
Multimodal large language models often struggle with faithful reasoning in complex visual scenes, where intricate entities and relations require precise visual grounding at each step. This reasoning unfaithfulness frequently manifests as hallucinated entities, mis-grounded relations, skipped steps, and over-specified reasoning. Existing preference-based approaches, typically relying on textual perturbations or answer-conditioned rationales, fail to address this challenge as they allow models to exploit language priors to bypass visual grounding. To address this, we propose SceneAlign, a framework that leverages scene graphs as structured visual information to perform controllable structural interventions. By identifying reasoning-critical nodes and perturbing them through four targeted strategies that mimic typical grounding failures, SceneAlign constructs hard negative rationales that remain linguistically plausible but are grounded in inaccurate visual facts. These contrastive pairs are used in Direct Preference Optimization to steer models toward fine-grained, structure-faithful reasoning. Across seven visual reasoning benchmarks, SceneAlign consistently improves answer accuracy and reasoning faithfulness, highlighting the effectiveness of grounding-aware alignment for multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning
visual grounding
reasoning faithfulness
scene graphs
hallucination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scene Graph
Visual Grounding
Multimodal Reasoning
Preference Optimization
Structural Intervention
🔎 Similar Papers
No similar papers found.