🤖 AI Summary
Current social assistive robots (SARs) exhibit significant limitations in real-time responsiveness, empathic feedback generation, and personalized spoken interaction, hindering their efficacy in mental health support. To address this, we propose the first integration of an end-to-end spoken language model (SLM) into an SAR architecture, enabling a low-latency, natural turn-taking, and emotion-adaptive spoken dialogue framework that unifies real-time speech understanding, generative empathic response generation, and anthropomorphic speech synthesis. User studies demonstrate statistically significant improvements in perceived conversational naturalness and empathy (p < 0.01); however, nonverbal behavioral synchronization and lexical diversity require further refinement. This work establishes a scalable, SLM-driven paradigm for SAR design in mental healthcare, advancing the deployment of embodied, affective human–robot spoken interaction.
📝 Abstract
Socially assistive robots (SARs) have shown great potential for supplementing well-being support. However, prior studies have found that existing dialogue pipelines for SARs remain limited in real-time latency, back-channeling, and personalized speech dialogue. Toward addressing these limitations, we propose using integrated end-to-end speech-language models (SLMs) with SARs. This work 1) evaluated the usability of an SLM-enabled SAR dialogue system through a small user study, and 2) identified remaining limitations through study user feedback to inform future improvements. We conducted a small within-participant user study with university students (N = 11) whose results showed that participants perceived an SLM-enabled SAR system as capable of providing empathetic feedback, natural turn-taking, back-channeling, and adaptive responses. We also found that participants reported the robot's nonverbal behaviors as lacking variability and synchronization with conversation, and the SLM's verbal feedback as generic and repetitive. These findings highlighted the need for real-time robot movement synchronized with conversation, improved prompting or fine-tuning to generate outputs better aligned with mental health practices, and more expressive, adaptive vocal generation.