IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Social robots struggle to robustly infer human intentions in multimodal settings. To address this challenge, this work proposes IntentVLM—a cognitively inspired, two-stage video-language framework that introduces forward-inverse modeling to open-vocabulary intention recognition for the first time. In the first stage, the model generates candidate goals; in the second, it performs structured selection-based reasoning to interpret intentions. This approach effectively mitigates hallucination in implicit reasoning and circumvents catastrophic forgetting. Evaluated on the IntentQA and Inst-IT Bench datasets, IntentVLM achieves up to 80% accuracy—surpassing baseline methods by a substantial 30% margin and approaching human-level performance.

Technology Category

Application Category

📝 Abstract

Improving the effectiveness of human-robot interaction requires social robots to accurately infer human goals through robust intention understanding. This challenge is particularly critical in multimodal settings, where agents must integrate heterogeneous signals including text, visual cues to form a coherent interpretation of user intent. This paper presents IntentVLM, a novel two-stage video-language framework designed for open-vocabulary human intention recognition. The approach is inspired by forward-inverse modeling in cognitive science by decomposing intention understanding into goal candidate generation followed by structured inference through selection, effectively reducing hallucinations in latent reasoning. Evaluated on the IntentQA and Inst-IT Bench datasets, IntentVLM achieves state-of-the-art results with up to 80% accuracy, notably surpassing the baseline performance by 30% and matches human performance. Our findings demonstrate that this structured reasoning approach enhances open-vocabulary intention understanding without catastrophic forgetting, offering a robust foundation for human-centered robotics.

Problem

Research questions and friction points this paper is trying to address.

intention recognition

open-vocabulary

human-robot interaction

multimodal understanding

video-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary intention recognition

forward-inverse modeling

video-language models