Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

Existing HOI detection methods suffer from a fundamental trade-off between accuracy and efficiency, primarily due to redundant computation and insufficient modeling capacity for mid-level human-object interactions. To address this, we propose an efficient yet accurate HOI detection framework featuring (i) a multi-scale wavelet attention backbone that enhances fine-grained feature representation at salient regions, and (ii) a learnable-ray-based encoder that jointly incorporates spatial geometric priors and region-aware decoding to achieve compact, discriminative interaction modeling. Our approach achieves state-of-the-art performance on standard benchmarks—including HICO-DET and V-COCO—while reducing computational overhead by 32% compared to prior methods. Crucially, it establishes a superior balance between detection accuracy and inference efficiency, advancing the practical applicability of HOI detection in real-world scenarios.

Technology Category

Application Category

📝 Abstract

Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. To address these challenges, we conceptualize a wavelet attention-like backbone and a novel ray-based encoder architecture tailored for HOI detection. Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from diverse convolutional filters. Concurrently, the ray-based encoder facilitates multi-scale attention by optimizing the focus of the decoder on relevant regions of interest and mitigating computational overhead. As a result of harnessing the attenuated intensity of learnable ray origins, our decoder aligns query embeddings with emphasized regions of interest for accurate predictions. Experimental results on benchmark datasets, including ImageNet and HICO-DET, showcase the potential of our proposed architecture. The code is publicly available at [https://github.com/henry-pay/RayEncoder].

Problem

Research questions and friction points this paper is trying to address.

Improving HOI detection accuracy and efficiency

Addressing middle-order interaction feature limitations

Reducing computational overhead in HOI detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wavelet attention backbone for feature aggregation

Ray-based encoder for multi-scale attention

Optimized decoder with learnable ray origins

🔎 Similar Papers

A Review of Human-Object Interaction Detection