Vision-Language System using Open-Source LLMs for Gestures in Medical Interpreter Robots

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work proposes the first privacy-preserving medical interpreting robot framework designed to address communication barriers in cross-lingual healthcare settings caused by the absence of nonverbal cues. The system integrates speech act recognition with gesture generation, leveraging a locally deployed open-source large language model enhanced by few-shot prompting and vision-language fusion techniques. It achieves high accuracy (0.90) and F1 score (0.91) in identifying speech acts—such as agreement or directives—from doctor–patient dialogues and produces natural, contextually appropriate gestures. To support this research, the authors introduce the first annotated dataset pairing speech acts with corresponding gestures. User studies demonstrate that the generated gestures significantly outperform baseline methods in naturalness while maintaining comparable appropriateness.

Technology Category

Application Category

📝 Abstract

Effective communication is vital in healthcare, especially across language barriers, where non-verbal cues and gestures are critical. This paper presents a privacy-preserving vision-language framework for medical interpreter robots that detects specific speech acts (consent and instruction) and generates corresponding robotic gestures. Built on locally deployed open-source models, the system utilizes a Large Language Model (LLM) with few-shot prompting for intent detection. We also introduce a novel dataset of clinical conversations annotated for speech acts and paired with gesture clips. Our identification module achieved 0.90 accuracy, 0.93 weighted precision, and a 0.91 weighted F1-Score. Our approach significantly improves computational efficiency and, in user studies, outperforms the speech-gesture generation baseline in human-likeness while maintaining comparable appropriateness.

Problem

Research questions and friction points this paper is trying to address.

medical interpreter robots

gesture generation

speech act detection

vision-language system

non-verbal communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language system

open-source LLMs

gesture generation