Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs

πŸ“… 2025-12-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing HOI detection methods are constrained by the closed-set assumption, limiting generalization to unseen or ambiguous interactions. This work reformulates HOI detection as an open-vocabulary generation task, eliminating reliance on predefined verb vocabularies. Our approach introduces three key innovations: (1) a differentiable cognitive-guidance mechanism coupled with a lightweight Cross-Modal Semantic Calibration (CSC) module, bridging frozen multimodal large language models (MLLMs) and fine-grained visual evidence; (2) a hybrid supervision strategy jointly optimizing language modeling and classification losses; and (3) efficient, parameter-light adaptation of frozen MLLMs. The method achieves state-of-the-art performance on standard closed-set HOI benchmarks while demonstrating strong zero-shot transfer capability. Notably, it unifies discriminative perception and generative reasoning within a single framework for the first time.

Technology Category

Application Category

πŸ“ Abstract
Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them. Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set, which struggles to generalize to the long-tail of unseen or ambiguous interactions in the wild. While recent multi-modal large language models (MLLMs) possess the rich world knowledge required for open-vocabulary understanding, they remain decoupled from existing HOI detectors since fine-tuning them is computationally prohibitive. To address these constraints, we propose GRASP-HO}, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem. To bridge the vision and cognitive, we first extract hybrid interaction representations, then design a lightweight learnable cognitive steering conduit (CSC) module to inject the fine-grained visual evidence into a frozen MLLM for effective reasoning. To address the supervision mismatch between classification-based HOI datasets and open-vocabulary generative models, we introduce a hybrid guidance strategy that coupling the language modeling loss and auxiliary classification loss, enabling discriminative grounding without sacrificing generative flexibility. Experiments demonstrate state-of-the-art closed-set performance and strong zero-shot generalization, achieving a unified paradigm that seamlessly bridges discriminative perception and generative reasoning for open-world HOI detection.
Problem

Research questions and friction points this paper is trying to address.

Detects human-object interactions beyond predefined verb sets
Integrates multi-modal LLMs with HOI detection without fine-tuning
Bridges discriminative perception and generative reasoning for open-world HOI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates HOI detection as open-vocabulary generation
Uses lightweight cognitive steering conduit for frozen MLLM
Introduces hybrid guidance strategy coupling generative and discriminative losses
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhaolin Cai
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University
Huiyu Duan
Huiyu Duan
Shanghai Jiao Tong University
Multimedia Signal Processing
Zitong Xu
Zitong Xu
Shanghai Jiao Tong University
Image Quality AssessmentImage Editing
F
Fan Li
Xi’an Jiao Tong University
Z
Zhi Liu
Shandong University
J
Jing Liu
Tianjin University
W
Wei Shen
Shandong University
X
Xiongkuo Min
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays