Vision-Language System using Open-Source LLMs for Gestures in Medical Interpreter Robots

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes the first privacy-preserving medical interpreting robot framework designed to address communication barriers in cross-lingual healthcare settings caused by the absence of nonverbal cues. The system integrates speech act recognition with gesture generation, leveraging a locally deployed open-source large language model enhanced by few-shot prompting and vision-language fusion techniques. It achieves high accuracy (0.90) and F1 score (0.91) in identifying speech acts—such as agreement or directives—from doctor–patient dialogues and produces natural, contextually appropriate gestures. To support this research, the authors introduce the first annotated dataset pairing speech acts with corresponding gestures. User studies demonstrate that the generated gestures significantly outperform baseline methods in naturalness while maintaining comparable appropriateness.

Technology Category

Application Category

📝 Abstract
Effective communication is vital in healthcare, especially across language barriers, where non-verbal cues and gestures are critical. This paper presents a privacy-preserving vision-language framework for medical interpreter robots that detects specific speech acts (consent and instruction) and generates corresponding robotic gestures. Built on locally deployed open-source models, the system utilizes a Large Language Model (LLM) with few-shot prompting for intent detection. We also introduce a novel dataset of clinical conversations annotated for speech acts and paired with gesture clips. Our identification module achieved 0.90 accuracy, 0.93 weighted precision, and a 0.91 weighted F1-Score. Our approach significantly improves computational efficiency and, in user studies, outperforms the speech-gesture generation baseline in human-likeness while maintaining comparable appropriateness.
Problem

Research questions and friction points this paper is trying to address.

medical interpreter robots
gesture generation
speech act detection
vision-language system
non-verbal communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language system
open-source LLMs
gesture generation
speech act detection
medical interpreter robot
🔎 Similar Papers
No similar papers found.
T
Thanh-Tung Ngo
Technological University Dublin
Emma Murphy
Emma Murphy
School of Computer Science, TU Dublin
Human Centred DesignDigital HealthData Ethics
R
Robert J. Ross
Technological University Dublin