FreeA: Human-object Interaction Detection using Free Annotation Labels

📅 2024-03-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
To address the heavy reliance of human-object interaction (HOI) detection on dense, labor-intensive manual annotations, this paper proposes a language-driven weakly supervised adaptation framework that completely eliminates the need for explicit interaction labels. Leveraging vision-language pre-trained models (e.g., CLIP), our method introduces a knowledge-guided masking mechanism and an interaction-aware association matching strategy; it automatically generates precise, pair-level HOI labels by aligning textual templates and optimizing semantic correlations between human-object pairs. To the best of our knowledge, this is the first end-to-end HOI detector operating entirely without any human-annotated interaction instances. Experiments demonstrate substantial improvements: on HICO-DET and V-COCO, our approach achieves mAP gains of +159% and +98% over prior state-of-the-art weakly supervised methods, and +28% and +34% over Weakly+, respectively—significantly enhancing both localization and classification accuracy.

Technology Category

Application Category

📝 Abstract
Recent human-object interaction (HOI) detection methods depend on extensively annotated image datasets, which require a significant amount of manpower. In this paper, we propose a novel self-adaptive, language-driven HOI detection method, termed FreeA. This method leverages the adaptability of the text-image model to generate latent HOI labels without requiring manual annotation. Specifically, FreeA aligns image features of human-object pairs with HOI text templates and employs a knowledge-based masking technique to decrease improbable interactions. Furthermore, FreeA implements a proposed method for matching interaction correlations to increase the probability of actions associated with a particular action, thereby improving the generated HOI labels. Experiments on two benchmark datasets showcase that FreeA achieves state-of-the-art performance among weakly supervised HOI competitors. Our proposal gets + extbf{13.29} ( extbf{159%$uparrow$}) mAP and + extbf{17.30} ( extbf{98%$uparrow$}) mAP than the newest ``Weakly'' supervised model, and + extbf{7.19} ( extbf{28%$uparrow$}) mAP and + extbf{14.69} ( extbf{34%$uparrow$}) mAP than the latest ``Weakly+'' supervised model, respectively, on HICO-DET and V-COCO datasets, more accurate in localizing and classifying the interactive actions. The source code will be made public.
Problem

Research questions and friction points this paper is trying to address.

Reduces reliance on manual HOI annotation labor
Improves accuracy in localizing interactive actions
Generates latent HOI labels without human annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-driven HOI detection without manual annotation
Knowledge-based masking to filter improbable interactions
Matching interaction correlations to improve action probability
🔎 Similar Papers
2024-08-202024 2nd International Conference on Computer, Vision and Intelligent Technology (ICCVIT)Citations: 2
Q
Qi Liu
School of Future Technology, South China University of Technology, China 511400
Y
Yuxiao Wang
School of Future Technology, South China University of Technology, China 511400
X
Xinyu Jiang
School of Future Technology, South China University of Technology, China 511400
W
Wolin Liang
Z
Zhenao Wei
School of Future Technology, South China University of Technology, China 511400
Y
Yu Lei
School of Information Science & Technology, Southwest Jiaotong University, China 611730
Z
Zhuang Nan
W
Weiying Xue
School of Future Technology, South China University of Technology, China 511400