What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenges of open-vocabulary human-object interaction (HOI) understanding in multimodal large language models, which are often hindered by cross-modal hallucinations and occlusion-induced ambiguities. The authors propose ImagineAgent, a novel framework that introduces a generative imagination mechanism to this task for the first time. By constructing an entity-action cognitive graph, ImagineAgent synergistically integrates cognitive reasoning with tool invocation—including retrieval augmentation, image cropping, and diffusion models—to dynamically acquire domain knowledge and visual evidence for effective cross-modal alignment. Coupled with multimodal reinforcement learning and a composite reward mechanism, the method achieves state-of-the-art performance on SWIG-HOI and HICO-DET using only approximately 20% of the training data, significantly enhancing both robustness and efficiency.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Open-Vocabulary HOI

cross-modal hallucinations

occlusion-induced ambiguity

multimodal reasoning

visual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

ImagineAgent

Open-Vocabulary HOI

Cognitive Map