🤖 AI Summary
To address the heavy reliance of human-object interaction (HOI) detection on dense, labor-intensive manual annotations, this paper proposes a language-driven weakly supervised adaptation framework that completely eliminates the need for explicit interaction labels. Leveraging vision-language pre-trained models (e.g., CLIP), our method introduces a knowledge-guided masking mechanism and an interaction-aware association matching strategy; it automatically generates precise, pair-level HOI labels by aligning textual templates and optimizing semantic correlations between human-object pairs. To the best of our knowledge, this is the first end-to-end HOI detector operating entirely without any human-annotated interaction instances. Experiments demonstrate substantial improvements: on HICO-DET and V-COCO, our approach achieves mAP gains of +159% and +98% over prior state-of-the-art weakly supervised methods, and +28% and +34% over Weakly+, respectively—significantly enhancing both localization and classification accuracy.
📝 Abstract
Recent human-object interaction (HOI) detection methods depend on extensively annotated image datasets, which require a significant amount of manpower. In this paper, we propose a novel self-adaptive, language-driven HOI detection method, termed FreeA. This method leverages the adaptability of the text-image model to generate latent HOI labels without requiring manual annotation. Specifically, FreeA aligns image features of human-object pairs with HOI text templates and employs a knowledge-based masking technique to decrease improbable interactions. Furthermore, FreeA implements a proposed method for matching interaction correlations to increase the probability of actions associated with a particular action, thereby improving the generated HOI labels. Experiments on two benchmark datasets showcase that FreeA achieves state-of-the-art performance among weakly supervised HOI competitors. Our proposal gets + extbf{13.29} ( extbf{159%$uparrow$}) mAP and + extbf{17.30} ( extbf{98%$uparrow$}) mAP than the newest ``Weakly'' supervised model, and + extbf{7.19} ( extbf{28%$uparrow$}) mAP and + extbf{14.69} ( extbf{34%$uparrow$}) mAP than the latest ``Weakly+'' supervised model, respectively, on HICO-DET and V-COCO datasets, more accurate in localizing and classifying the interactive actions. The source code will be made public.