Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world assisted teleoperation faces significant challenges in real-time human intent understanding due to the diversity of user intentions and the open-ended nature of operational environments; existing approaches are constrained by predefined scenarios or task-specific data distributions. This paper introduces the first teleoperation framework integrating the commonsense reasoning capabilities of vision-language models (VLMs), featuring an open-world perception module and a commonsense-driven intent inference mechanism that enables open-set perception, long-horizon task generalization, and zero-shot intent understanding. Additionally, we design an extensible skill library to support flexible mobile manipulation. Experimental results demonstrate that our method substantially improves task completion rates, reduces user cognitive load, and achieves higher user satisfaction compared to both direct teleoperation and state-of-the-art assisted baselines.

Technology Category

Application Category

📝 Abstract
Assistive teleoperation, where control is shared between a human and a robot, enables efficient and intuitive human-robot collaboration in diverse and unstructured environments. A central challenge in real-world assistive teleoperation is for the robot to infer a wide range of human intentions from user control inputs and to assist users with correct actions. Existing methods are either confined to simple, predefined scenarios or restricted to task-specific data distributions at training, limiting their support for real-world assistance. We introduce Casper, an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models (VLMs) for real-time intent inference and flexible skill execution. Casper incorporates an open-world perception module for a generalized understanding of novel objects and scenes, a VLM-powered intent inference mechanism that leverages commonsense reasoning to interpret snippets of teleoperated user input, and a skill library that expands the scope of prior assistive teleoperation systems to support diverse, long-horizon mobile manipulation tasks. Extensive empirical evaluation, including human studies and system ablations, demonstrates that Casper improves task performance, reduces human cognitive load, and achieves higher user satisfaction than direct teleoperation and assistive teleoperation baselines.
Problem

Research questions and friction points this paper is trying to address.

Infer diverse human intentions from teleoperation inputs
Generalize assistance beyond predefined scenarios and tasks
Enable flexible skill execution in unstructured environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision language models for intent inference
Open-world perception for novel scenes
Skill library for diverse manipulation tasks
🔎 Similar Papers
No similar papers found.
H
Huihan Liu
The University of Texas at Austin
R
Rutav Shah
The University of Texas at Austin
Shuijing Liu
Shuijing Liu
Postdoc, The University of Texas at Austin
Robot LearningHuman Robot Interaction
J
Jack Pittenger
The University of Texas at Austin
M
Mingyo Seo
The University of Texas at Austin
Y
Yuchen Cui
The University of California, Los Angeles
Yonatan Bisk
Yonatan Bisk
Assistant Professor, Carnegie Mellon University
Natural Language ProcessingEmbodied AIRobot Learning
R
Roberto Mart'in-Mart'in
The University of Texas at Austin
Yuke Zhu
Yuke Zhu
The University of Texas at Austin, NVIDIA Research
Robot LearningComputer VisionMachine LearningRoboticsArtificial Intelligence