FreeA: Human-object Interaction Detection using Free Annotation Labels

📅 2024-03-04

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the heavy reliance of human-object interaction (HOI) detection on dense, labor-intensive manual annotations, this paper proposes a language-driven weakly supervised adaptation framework that completely eliminates the need for explicit interaction labels. Leveraging vision-language pre-trained models (e.g., CLIP), our method introduces a knowledge-guided masking mechanism and an interaction-aware association matching strategy; it automatically generates precise, pair-level HOI labels by aligning textual templates and optimizing semantic correlations between human-object pairs. To the best of our knowledge, this is the first end-to-end HOI detector operating entirely without any human-annotated interaction instances. Experiments demonstrate substantial improvements: on HICO-DET and V-COCO, our approach achieves mAP gains of +159% and +98% over prior state-of-the-art weakly supervised methods, and +28% and +34% over Weakly+, respectively—significantly enhancing both localization and classification accuracy.

Technology Category

Application Category

📝 Abstract

Recent human-object interaction (HOI) detection methods depend on extensively annotated image datasets, which require a significant amount of manpower. In this paper, we propose a novel self-adaptive, language-driven HOI detection method, termed FreeA. This method leverages the adaptability of the text-image model to generate latent HOI labels without requiring manual annotation. Specifically, FreeA aligns image features of human-object pairs with HOI text templates and employs a knowledge-based masking technique to decrease improbable interactions. Furthermore, FreeA implements a proposed method for matching interaction correlations to increase the probability of actions associated with a particular action, thereby improving the generated HOI labels. Experiments on two benchmark datasets showcase that FreeA achieves state-of-the-art performance among weakly supervised HOI competitors. Our proposal gets + extbf{13.29} ( extbf{159%$uparrow$}) mAP and + extbf{17.30} ( extbf{98%$uparrow$}) mAP than the newest ``Weakly'' supervised model, and + extbf{7.19} ( extbf{28%$uparrow$}) mAP and + extbf{14.69} ( extbf{34%$uparrow$}) mAP than the latest ``Weakly+'' supervised model, respectively, on HICO-DET and V-COCO datasets, more accurate in localizing and classifying the interactive actions. The source code will be made public.

Problem

Research questions and friction points this paper is trying to address.

Reduces reliance on manual HOI annotation labor

Improves accuracy in localizing interactive actions

Generates latent HOI labels without human annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-driven HOI detection without manual annotation

Knowledge-based masking to filter improbable interactions

Matching interaction correlations to improve action probability

🔎 Similar Papers

A Review of Human-Object Interaction Detection