🤖 AI Summary
This paper addresses human-object interaction (HOI) detection—a core problem in high-level visual understanding—requiring precise localization of humans and objects in images/videos alongside fine-grained interaction classification. Methodologically, it systematically surveys mainstream benchmarks and contrasts two dominant paradigms: two-stage and end-to-end HOI detection frameworks. Crucially, it presents the first unified empirical and mechanistic comparison between them. The work further pioneers novel directions in zero-shot and weakly supervised HOI learning, introducing a multimodal large language model–enabled collaborative reasoning mechanism. By synthesizing methodological evolution, capability boundaries, and persistent technical bottlenecks, the paper constructs a comprehensive HOI landscape. It identifies three critical future research frontiers: scalable interaction modeling, open-vocabulary generalization, and temporally coherent video-based HOI detection.
📝 Abstract
Human-object interaction (HOI) detection plays a key role in high-level visual understanding, facilitating a deep comprehension of human activities. Specifically, HOI detection aims to locate the humans and objects involved in interactions within images or videos and classify their specific interactions. The success of this task is influenced by several key factors, including the accurate localization of human and object instances and the correct classification of object categories and interaction relationships. This paper systematically summarizes and discusses the recent work in image-based HOI detection. First, the mainstream datasets involved in HOI relationship detection are introduced. Furthermore, starting with two-stage methods and end-to-end one-stage detection approaches, this paper comprehensively discusses the current developments in image-based HOI detection, analyzing the strengths and weaknesses of these two methods. Additionally, the advancements of zero-shot learning, weakly supervised learning, and the application of large-scale language models in HOI detection are discussed. Finally, the current challenges in HOI detection are outlined, and potential research directions and future trends are explored.