Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Real-time and interpretable understanding of dynamic human intent remains a key challenge in human-robot collaboration. Method: We propose a vision-language-text multimodal intent reasoning framework that jointly fine-tunes a vision-language model (VLM) and a large language model (LLM) to establish semantic priors. Context-aware object and spatial region relevance ranking is achieved via YOLO-based detection, SAM-based segmentation, and a multilayer weighted decision mechanism (integrated within the GUIDER framework). The VLM and LLM jointly serve as semantic filters, enabling task-prompt-driven object selection and adaptive replanning under intent shifts. Contribution/Results: Evaluated in simulation, our system significantly improves intent recognition accuracy and behavioral response latency, supports end-to-end navigation-and-grasping, and provides transparent, step-by-step reasoning traces. It offers a scalable, interpretable technical pathway for real-world deployment.

Technology Category

Application Category

📝 Abstract

Human-robot collaboration requires robots to quickly infer user intent, provide transparent reasoning, and assist users in achieving their goals. Our recent work introduced GUIDER, our framework for inferring navigation and manipulation intents. We propose augmenting GUIDER with a vision-language model (VLM) and a text-only language model (LLM) to form a semantic prior that filters objects and locations based on the mission prompt. A vision pipeline (YOLO for object detection and the Segment Anything Model for instance segmentation) feeds candidate object crops into the VLM, which scores their relevance given an operator prompt; in addition, the list of detected object labels is ranked by a text-only LLM. These scores weight the existing navigation and manipulation layers of GUIDER, selecting context-relevant targets while suppressing unrelated objects. Once the combined belief exceeds a threshold, autonomy changes occur, enabling the robot to navigate to the desired area and retrieve the desired object, while adapting to any changes in the operator's intent. Future work will evaluate the system on Isaac Sim using a Franka Emika arm on a Ridgeback base, with a focus on real-time assistance.

Problem

Research questions and friction points this paper is trying to address.

Enhance human-robot collaboration through intent recognition

Improve object and location filtering using vision-language models

Enable autonomous navigation and manipulation based on user intent

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model for intent recognition

Text-only LLM ranks detected objects

Combined belief threshold triggers autonomy

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs