Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Zero-shot human-object interaction (HOI) detection faces two core challenges: intra-class visual diversity—where the same verb exhibits large appearance variations due to differing human poses or contextual configurations—and inter-class visual entanglement—where distinct verbs share highly similar visual appearances. To address these, we propose the Visual Diversity-aware and Region-aware Prompt learning framework (VDRP). VDRP explicitly models visual variance via grouped variance estimation and enhances prompt diversity through Gaussian perturbation; it disentangles semantic concepts across human-, object-, and interaction-region features, and integrates contextual embeddings from pre-trained vision-language models (e.g., CLIP) with region-aware retrieval mechanisms. This design significantly improves discriminability for unseen verb–object pairs. Evaluated under four zero-shot settings on HICO-DET, VDRP achieves state-of-the-art performance across all benchmarks, effectively mitigating visual ambiguity in zero-shot HOI detection.

Technology Category

Application Category

📝 Abstract

Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.

Problem

Research questions and friction points this paper is trying to address.

Addressing intra-class visual diversity in human-object interactions

Resolving inter-class visual entanglement between different verbs

Enhancing zero-shot HOI detection with region-aware prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual diversity-aware prompt learning with group variance

Gaussian perturbation captures diverse verb variations

Region-aware prompts from human/object/union concepts

🔎 Similar Papers

No similar papers found.