Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary human-object interaction (HOI) detection requires generalization to unseen interaction categories, yet existing vision-language model (VLM)-based approaches suffer from a mismatch between image-level pretraining and fine-grained region-level interaction modeling, and textual descriptions often fail to capture discriminative visual appearance details. To address this, we propose an interaction-aware prompt generator that dynamically constructs scene-adaptive prompts to facilitate knowledge transfer among semantically similar interactions. We further introduce a language-model-guided concept calibration mechanism to enhance the discriminability of interaction representations. Our framework integrates region-level feature modeling, cross-modal similarity optimization, and hard negative sampling to improve fine-grained relational reasoning and zero-shot generalization. Extensive experiments demonstrate state-of-the-art performance on SWIG-HOI and HICO-DET, validating both effectiveness and strong generalization capability.

Technology Category

Application Category

📝 Abstract
Open Vocabulary Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects while generalizing to novel interaction classes beyond the training set. Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders, as image-level pre-training does not align well with the fine-grained region-level interaction detection required for HOI. Additionally, effectively encoding textual descriptions of visual appearances remains difficult, limiting the model's ability to capture detailed HOI relationships. To address these issues, we propose INteraction-aware Prompting with Concept Calibration (INP-CC), an end-to-end open-vocabulary HOI detector that integrates interaction-aware prompts and concept calibration. Specifically, we propose an interaction-aware prompt generator that dynamically generates a compact set of prompts based on the input scene, enabling selective sharing among similar interactions. This approach directs the model's attention to key interaction patterns rather than generic image-level semantics, enhancing HOI detection. Furthermore, we refine HOI concept representations through language model-guided calibration, which helps distinguish diverse HOI concepts by investigating visual similarities across categories. A negative sampling strategy is also employed to improve inter-modal similarity modeling, enabling the model to better differentiate visually similar but semantically distinct actions. Extensive experimental results demonstrate that INP-CC significantly outperforms state-of-the-art models on the SWIG-HOI and HICO-DET datasets. Code is available at https://github.com/ltttpku/INP-CC.
Problem

Research questions and friction points this paper is trying to address.

Detect human-object interactions for novel classes beyond training data
Improve fine-grained region-level interaction detection in VLMs
Enhance textual encoding for detailed HOI relationship understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic interaction-aware prompt generation
Language-guided HOI concept calibration
Negative sampling for similarity modeling
🔎 Similar Papers
No similar papers found.