Taking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation

๐Ÿ“… 2025-02-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing open-vocabulary scene graph generation (OVSGG) methods neglect explicit modeling of object interactions, leading to relation misalignment and poor generalization to unseen categories. To address this, we propose INOVA, the first interaction-aware framework for OVSGG. INOVA introduces three key innovations: (1) an interaction-discriminative object proposal generation strategy to mitigate label ambiguity; (2) an interaction-guided bipartite matching mechanism to improve relation alignment accuracy; and (3) interaction-consistent knowledge distillation to strengthen cross-modal semantic alignment. Built upon CLIP/ViL for open-vocabulary reasoning, INOVA adopts a two-stage training paradigmโ€”weakly supervised pretraining followed by supervised fine-tuning. Evaluated on the Visual Genome (VG) and GQA benchmarks, INOVA achieves state-of-the-art performance, significantly improving interaction relation recognition (+4.2% Recall@100) and generalization to open-world categories.

Technology Category

Application Category

๐Ÿ“ Abstract
Today's open vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Most existing methods adopt a two-stage pipeline: weakly supervised pre-training with image captions and supervised fine-tuning (SFT) on fully annotated scene graphs. Nonetheless, they omit explicit modeling of interacting objects and treat all objects equally, resulting in mismatched relation pairs. To this end, we propose an interaction-aware OVSGG framework INOVA. During pre-training, INOVA employs an interaction-aware target generation strategy to distinguish interacting objects from non-interacting ones. In SFT, INOVA devises an interaction-guided query selection tactic to prioritize interacting objects during bipartite graph matching. Besides, INOVA is equipped with an interaction-consistent knowledge distillation to enhance the robustness by pushing interacting object pairs away from the background. Extensive experiments on two benchmarks (VG and GQA) show that INOVA achieves state-of-the-art performance, demonstrating the potential of interaction-aware mechanisms for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Interaction-aware Open Vocabulary Scene Graph Generation
Explicit modeling of interacting objects
State-of-the-art performance on benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interaction-aware target generation strategy
Interaction-guided query selection tactic
Interaction-consistent knowledge distillation
๐Ÿ”Ž Similar Papers
No similar papers found.
L
Lin Li
The Hong Kong University of Science and Technology, Hong Kong
Chuhan Zhang
Chuhan Zhang
Hong Kong University of Science and Technology
computer vision
D
Dong Zhang
The Hong Kong University of Science and Technology, Hong Kong
Chong Sun
Chong Sun
Tencent WeChat
Computer Vision
C
Chen Li
Tencent, China
L
Long Chen
The Hong Kong University of Science and Technology, Hong Kong