HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Zero-shot human-object interaction (HOI) detection faces two key challenges: poor generalization to unseen action categories and difficulty distinguishing between distinct actions involving the same object. To address these, we propose a low-rank decomposition-based vision-language feature adaptation framework. First, we apply low-rank decomposition to frozen vision-language model (VLM) visual features to decouple shared bases from learnable weights. Second, we introduce explicit human-object tokens to enhance structural modeling of interaction patterns. Third, we leverage large language model (LLM)-generated action semantics to regularize the action embedding space via semantic constraints. Our approach significantly improves discriminability and generalization for unseen verbs. Under the zero-shot setting on HICO-DET, it achieves state-of-the-art performance, attaining a 27.91 mAP on unseen verbs.

Technology Category

Application Category

📝 Abstract

Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, HOLa decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting. Our code is available at https://github.com/ChelsieLei/HOLa.

Problem

Research questions and friction points this paper is trying to address.

Enhancing generalization to unseen HOI classes

Improving distinction between similar actions

Adapting VLM features for zero-shot detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank decomposed VLM feature adaptation

Human-object tokens enrich visual interaction

LLM-derived action regularization for unseen actions

🔎 Similar Papers

No similar papers found.

Authors to Follow