🤖 AI Summary
This study addresses the limited intuitiveness and precision of human–robot physical interaction in stiffness teleoperation for remote robotics. We propose a semi-autonomous 3D stiffness ellipsoid control method integrating gaze and speech inputs. Real-time gaze tracking is performed using Tobii Pro Glasses 2, while a GPT-4o–driven vision-language model interprets multimodal commands (speech + visual context) to enable context-aware stiffness modulation of a KUKA LBR iiwa manipulator and Force Dimension Sigma.7 haptic device. To our knowledge, this is the first work introducing dual-modal (gaze–speech) input into stiffness teleoperation interfaces, establishing an end-to-end intent understanding and ellipsoid parameter mapping framework. Experimental results demonstrate that our prompting strategy significantly improves intent recognition accuracy; in a slot-following task, it supports multi-dimensional control—including stiffness center localization, axial scaling, and orientation adjustment—yielding a 23% improvement in task completion efficiency and a 37% increase in subjective intuitiveness ratings.
📝 Abstract
The paper presents a visio-verbal teleimpedance interface for commanding 3D stiffness ellipsoids to the remote robot with a combination of the operator's gaze and verbal interaction. The gaze is detected by an eye-tracker, allowing the system to understand the context in terms of what the operator is currently looking at in the scene. Along with verbal interaction, a Visual Language Model (VLM) processes this information, enabling the operator to communicate their intended action or provide corrections. Based on these inputs, the interface can then generate appropriate stiffness matrices for different physical interaction actions. To validate the proposed visio-verbal teleimpedance interface, we conducted a series of experiments on a setup including a Force Dimension Sigma.7 haptic device to control the motion of the remote Kuka LBR iiwa robotic arm. The human operator's gaze is tracked by Tobii Pro Glasses 2, while human verbal commands are processed by a VLM using GPT-4o. The first experiment explored the optimal prompt configuration for the interface. The second and third experiments demonstrated different functionalities of the interface on a slide-in-the-groove task.