DQEN: Dual Query Enhancement Network for DETR-based HOI Detection

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

In DETR-based human-object interaction (HOI) detection, randomly initialized queries suffer from ambiguous representations and insufficient modeling of interaction semantics. To address this, we propose the Dual-Query Enhancement Network (DQEN). Methodologically: (1) an object-aware encoder feature is explicitly integrated to enhance object queries; (2) an interaction semantic fusion module leverages CLIP’s pretrained language model to generate candidate interaction semantics and inject them into interaction queries; (3) an auxiliary prediction head refines interaction feature representation. DQEN implements dual-stream query co-enhancement and cross-modal attention fusion within the DETR backbone. Evaluated on HICO-Det and V-COCO, DQEN achieves state-of-the-art performance, notably improving detection accuracy and generalization for sparse interaction relations.

Technology Category

Application Category

📝 Abstract

Human-Object Interaction (HOI) detection focuses on localizing human-object pairs and recognizing their interactions. Recently, the DETR-based framework has been widely adopted in HOI detection. In DETR-based HOI models, queries with clear meaning are crucial for accurately detecting HOIs. However, prior works have typically relied on randomly initialized queries, leading to vague representations that limit the model's effectiveness. Meanwhile, humans in the HOI categories are fixed, while objects and their interactions are variable. Therefore, we propose a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries. Specifically, object queries are enhanced with object-aware encoder features, enabling the model to focus more effectively on humans interacting with objects in an object-aware way. On the other hand, we design a novel Interaction Semantic Fusion module to exploit the HOI candidates that are promoted by the CLIP model. Semantic features are extracted to enhance the initialization of interaction queries, thereby improving the model's ability to understand interactions. Furthermore, we introduce an Auxiliary Prediction Unit aimed at improving the representation of interaction features. Our proposed method achieves competitive performance on both the HICO-Det and the V-COCO datasets. The source code is available at https://github.com/lzzhhh1019/DQEN.

Problem

Research questions and friction points this paper is trying to address.

Enhancing object queries with object-aware features for better focus

Improving interaction query initialization using semantic fusion from CLIP

Boosting interaction feature representation through auxiliary prediction units

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object queries enhanced with object-aware encoder features

Interaction Semantic Fusion module exploits CLIP-promoted HOI candidates

Auxiliary Prediction Unit improves interaction feature representation

🔎 Similar Papers

No similar papers found.