🤖 AI Summary
This work proposes the first privacy-preserving medical interpreting robot framework designed to address communication barriers in cross-lingual healthcare settings caused by the absence of nonverbal cues. The system integrates speech act recognition with gesture generation, leveraging a locally deployed open-source large language model enhanced by few-shot prompting and vision-language fusion techniques. It achieves high accuracy (0.90) and F1 score (0.91) in identifying speech acts—such as agreement or directives—from doctor–patient dialogues and produces natural, contextually appropriate gestures. To support this research, the authors introduce the first annotated dataset pairing speech acts with corresponding gestures. User studies demonstrate that the generated gestures significantly outperform baseline methods in naturalness while maintaining comparable appropriateness.
📝 Abstract
Effective communication is vital in healthcare, especially across language barriers, where non-verbal cues and gestures are critical. This paper presents a privacy-preserving vision-language framework for medical interpreter robots that detects specific speech acts (consent and instruction) and generates corresponding robotic gestures. Built on locally deployed open-source models, the system utilizes a Large Language Model (LLM) with few-shot prompting for intent detection. We also introduce a novel dataset of clinical conversations annotated for speech acts and paired with gesture clips. Our identification module achieved 0.90 accuracy, 0.93 weighted precision, and a 0.91 weighted F1-Score. Our approach significantly improves computational efficiency and, in user studies, outperforms the speech-gesture generation baseline in human-likeness while maintaining comparable appropriateness.