π€ AI Summary
In DETR-based human-object interaction (HOI) detection, randomly initialized queries suffer from ambiguous representations and insufficient modeling of interaction semantics. To address this, we propose the Dual-Query Enhancement Network (DQEN). Methodologically: (1) an object-aware encoder feature is explicitly integrated to enhance object queries; (2) an interaction semantic fusion module leverages CLIPβs pretrained language model to generate candidate interaction semantics and inject them into interaction queries; (3) an auxiliary prediction head refines interaction feature representation. DQEN implements dual-stream query co-enhancement and cross-modal attention fusion within the DETR backbone. Evaluated on HICO-Det and V-COCO, DQEN achieves state-of-the-art performance, notably improving detection accuracy and generalization for sparse interaction relations.
π Abstract
Human-Object Interaction (HOI) detection focuses on localizing human-object pairs and recognizing their interactions. Recently, the DETR-based framework has been widely adopted in HOI detection. In DETR-based HOI models, queries with clear meaning are crucial for accurately detecting HOIs. However, prior works have typically relied on randomly initialized queries, leading to vague representations that limit the model's effectiveness. Meanwhile, humans in the HOI categories are fixed, while objects and their interactions are variable. Therefore, we propose a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries. Specifically, object queries are enhanced with object-aware encoder features, enabling the model to focus more effectively on humans interacting with objects in an object-aware way. On the other hand, we design a novel Interaction Semantic Fusion module to exploit the HOI candidates that are promoted by the CLIP model. Semantic features are extracted to enhance the initialization of interaction queries, thereby improving the model's ability to understand interactions. Furthermore, we introduce an Auxiliary Prediction Unit aimed at improving the representation of interaction features. Our proposed method achieves competitive performance on both the HICO-Det and the V-COCO datasets. The source code is available at https://github.com/lzzhhh1019/DQEN.