Streamlined Open-Vocabulary Human-Object Interaction Detection

📅 2026-03-28

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of feature fusion in open-vocabulary human-object interaction (HOI) detection, which arises from cross-modal representation discrepancies. The authors propose SL-HOI, a novel framework that leverages a single frozen DINOv3 model as its backbone, introducing only minimal learnable components. Specifically, SL-HOI exploits the DINOv3 backbone for precise spatial localization and employs a text-aligned visual head for open-vocabulary HOI classification. To bridge the modality gap, interaction queries are jointly processed with image tokens through a cross-attention mechanism within the visual head. Evaluated on the SWiG-HOI and HICO-DET benchmarks, SL-HOI achieves state-of-the-art performance, demonstrating the effectiveness of its lightweight, end-to-end architecture while maintaining high efficiency.

Technology Category

Application Category

📝 Abstract

Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training. Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories. However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce SL-HOI, a StreamLined open-vocabulary HOI detection framework based solely on the powerful DINOv3 model. Our design leverages the complementary strengths of DINOv3's components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification. Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head's output, we propose first feeding both the interaction queries and the backbone image tokens into the vision head, effectively bridging their representation gaps. All DINOv3 parameters in our approach are frozen, with only a small number of learnable parameters added, allowing a fast adaptation to the HOI detection task. Extensive experiments show that SL-HOI achieves state-of-the-art performance on both the SWiG-HOI and HICO-DET benchmarks, demonstrating the effectiveness of our streamlined model architecture. Code is available at https://github.com/MPI-Lab/SL-HOI.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary

human-object interaction detection

feature fusion

cross-model representation

Vision-Language Model

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary HOI detection

DINOv3

vision-language alignment