🤖 AI Summary
Existing conversational systems predominantly focus on text generation, neglecting prosodic expressivity and naturalness in speech output. This work addresses this gap by proposing the first human-like multimodal conversational agent designed for emotionally expressive speech responses. Methodologically, we (1) introduce the first multisensory dialogue dataset integrating linguistic, visual, and acoustic cues; (2) establish a novel speech generation paradigm that jointly models dialogue emotion and response style; and (3) leverage a multimodal large language model to generate text responses enriched with paralinguistic descriptions—explicitly encoding intonation, rhythm, and affect—which drive end-to-end speech synthesis. Experimental results demonstrate that audiovisual modality synergy significantly improves emotional fidelity and naturalness of synthesized speech. User studies confirm superior anthropomorphism and engagement compared to conventional TTS approaches.
📝 Abstract
Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a novel MultiSensory Conversation dataset focused on speech to enable agents to generate natural speech. We then propose a multimodal LLM-based model for generating text responses and voice descriptions, which are used to generate speech covering paralinguistic information. Experimental results demonstrate the effectiveness of utilizing both visual and audio modalities in conversation to generate engaging speech. The source code is available in https://github.com/kimtaesu24/MSenC