Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary human-object interaction (HOI) detection requires generalization to unseen <human, verb, object> triplets at inference, yet existing vision-language model (VLM)-based approaches rely on coarse-grained visual features, limiting fine-grained instance localization and relational discrimination. To address this, we propose a bidirectional collaborative framework that— for the first time—achieves fine-grained cross-modal alignment at both the instance level and token level. Our method employs attention biasing to generate discriminative interaction features and leverages large language models (LLMs) to provide token-level semantic supervision, integrated with feature disentanglement and supervision transfer mechanisms. Evaluated on HICO-DET and V-COCO, our approach significantly improves open-vocabulary generalization while maintaining state-of-the-art closed-set performance, demonstrating the efficacy of fine-grained vision-language alignment for HOI detection.

Technology Category

Application Category

📝 Abstract
Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.
Problem

Research questions and friction points this paper is trying to address.

Detect open vocabulary human-object interactions accurately
Improve fine-grained interaction features in vision-language models
Enhance attention bias and supervision for HOI detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilateral Collaboration framework for HOI detection
Attention Bias Guidance for fine-grained features
LLM-based Supervision Guidance for token-level supervision
🔎 Similar Papers
No similar papers found.