🤖 AI Summary
Existing HOI detection methods suffer from a fundamental trade-off between accuracy and efficiency, primarily due to redundant computation and insufficient modeling capacity for mid-level human-object interactions. To address this, we propose an efficient yet accurate HOI detection framework featuring (i) a multi-scale wavelet attention backbone that enhances fine-grained feature representation at salient regions, and (ii) a learnable-ray-based encoder that jointly incorporates spatial geometric priors and region-aware decoding to achieve compact, discriminative interaction modeling. Our approach achieves state-of-the-art performance on standard benchmarks—including HICO-DET and V-COCO—while reducing computational overhead by 32% compared to prior methods. Crucially, it establishes a superior balance between detection accuracy and inference efficiency, advancing the practical applicability of HOI detection in real-world scenarios.
📝 Abstract
Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. To address these challenges, we conceptualize a wavelet attention-like backbone and a novel ray-based encoder architecture tailored for HOI detection. Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from diverse convolutional filters. Concurrently, the ray-based encoder facilitates multi-scale attention by optimizing the focus of the decoder on relevant regions of interest and mitigating computational overhead. As a result of harnessing the attenuated intensity of learnable ray origins, our decoder aligns query embeddings with emphasized regions of interest for accurate predictions. Experimental results on benchmark datasets, including ImageNet and HICO-DET, showcase the potential of our proposed architecture. The code is publicly available at [https://github.com/henry-pay/RayEncoder].