SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection

📅 2025-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary human-object interaction (OV-HOI) detection methods rely on single-layer CLIP visual features and coarse-grained interaction descriptions generated by large language models (LLMs), leading to loss of object-level detail and confusion among semantically similar categories. To address this, we propose a Hierarchical Perception Alignment framework: (1) the Granularity Sensing Alignment module fuses multi-layer CLIP visual features to enable fine-grained cross-layer feature alignment; (2) the Hierarchical Group Comparison module guides the LLM to generate discriminative, fine-grained interaction descriptions, mitigating semantic bias. Our approach is the first to jointly model multi-layer visual representations, semantic group guidance, and recursive inter-class comparison. Evaluated on SWIG-HOI and HICO-DET, it achieves state-of-the-art performance, with particularly notable improvements in zero-shot interaction category detection accuracy.

Technology Category

Application Category

📝 Abstract
Recent open-vocabulary human-object interaction (OV-HOI) detection methods primarily rely on large language model (LLM) for generating auxiliary descriptions and leverage knowledge distilled from CLIP to detect unseen interaction categories. Despite their effectiveness, these methods face two challenges: (1) feature granularity deficiency, due to reliance on last layer visual features for text alignment, leading to the neglect of crucial object-level details from intermediate layers; (2) semantic similarity confusion, resulting from CLIP's inherent biases toward certain classes, while LLM-generated descriptions based solely on labels fail to adequately capture inter-class similarities. To address these challenges, we propose a stratified granular comparison network. First, we introduce a granularity sensing alignment module that aggregates global semantic features with local details, refining interaction representations and ensuring robust alignment between intermediate visual features and text embeddings. Second, we develop a hierarchical group comparison module that recursively compares and groups classes using LLMs, generating fine-grained and discriminative descriptions for each interaction category. Experimental results on two widely-used benchmark datasets, SWIG-HOI and HICO-DET, demonstrate that our method achieves state-of-the-art results in OV-HOI detection. Codes will be released on https://github.com/Phil0212/SGC-Net.
Problem

Research questions and friction points this paper is trying to address.

Addresses feature granularity deficiency in OV-HOI detection.
Resolves semantic similarity confusion from CLIP biases.
Improves interaction category descriptions using hierarchical grouping.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Granularity sensing alignment module aggregates features.
Hierarchical group comparison module refines descriptions.
State-of-the-art results in OV-HOI detection achieved.
🔎 Similar Papers
No similar papers found.
X
Xin Lin
Guangzhou University
Chong Shi
Chong Shi
Guangzhou university
Zuopeng Yang
Zuopeng Yang
Shanghai Jiao Tong University
Generative ModelDiffusion ModelAIGC
H
Haojin Tang
Guangzhou University
Z
Zhili Zhou
Guangzhou University