Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning

๐Ÿ“… 2025-05-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Traditional scene graphs are limited to modeling spatial relationships, hindering visual-language models (VLMs) from performing complex, generalized functional interaction reasoning among objects. To address this, we propose the Interaction-Enhanced Scene Graph Reasoning (ISGR) frameworkโ€”the first to integrate functional interaction modeling, interaction-focused dual-stream graph construction, and long-term memory-augmented reinforcement learning. ISGR leverages SAM-guided interaction-aware segmentation, functional semantic embedding alignment, targeted interaction query activation, and an interaction-centric reward function to transform learned interaction patterns into long-horizon reasoning heuristics. Evaluated on interaction-intensive benchmarks, ISGR achieves a 12.7% absolute accuracy improvement over strong baselines. This demonstrates both the effectiveness and necessity of memory-enhanced functional interaction modeling for cross-scene generalization in visual reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes. We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components. First, our dual-stream graph constructor combines SAM-powered spatial relation extraction with interaction-aware captioning to generate functionally salient scene graphs with spatial grounding. Second, we employ targeted interaction queries to activate VLMs' latent knowledge of object functionalities, converting passive recognition into active reasoning about how objects work together. Finally, we introduce a lone-term memory reinforcement learning strategy with a specialized interaction-focused reward function that transforms transient patterns into long-term reasoning heuristics. Extensive experiments demonstrate that our approach significantly outperforms baseline methods on interaction-heavy reasoning benchmarks, with particularly strong improvements on complex scene understanding tasks. The source code can be accessed at https://github.com/open_upon_acceptance.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLMs' ability to reason about complex visual interactions
Addressing unfocused relationship sets in scene graph construction
Improving persistent memory for interaction reasoning generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream graph constructor combines spatial and interaction-aware captioning
Targeted interaction queries activate VLMs' latent functional knowledge
Long-term memory reinforcement learning with interaction-focused rewards
๐Ÿ”Ž Similar Papers
No similar papers found.