Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing conversational systems predominantly focus on text generation, neglecting prosodic expressivity and naturalness in speech output. This work addresses this gap by proposing the first human-like multimodal conversational agent designed for emotionally expressive speech responses. Methodologically, we (1) introduce the first multisensory dialogue dataset integrating linguistic, visual, and acoustic cues; (2) establish a novel speech generation paradigm that jointly models dialogue emotion and response style; and (3) leverage a multimodal large language model to generate text responses enriched with paralinguistic descriptions—explicitly encoding intonation, rhythm, and affect—which drive end-to-end speech synthesis. Experimental results demonstrate that audiovisual modality synergy significantly improves emotional fidelity and naturalness of synthesized speech. User studies confirm superior anthropomorphism and engagement compared to conventional TTS approaches.

Technology Category

Application Category

📝 Abstract
Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a novel MultiSensory Conversation dataset focused on speech to enable agents to generate natural speech. We then propose a multimodal LLM-based model for generating text responses and voice descriptions, which are used to generate speech covering paralinguistic information. Experimental results demonstrate the effectiveness of utilizing both visual and audio modalities in conversation to generate engaging speech. The source code is available in https://github.com/kimtaesu24/MSenC
Problem

Research questions and friction points this paper is trying to address.

Generating natural and engaging speech for conversational agents
Integrating mood and style cues into multimodal speech generation
Addressing lack of paralinguistic information in text-based responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates speech from multimodal conversational inputs
Uses mood and style to create engaging paralinguistic speech
Builds novel MultiSensory dataset for natural speech generation
🔎 Similar Papers
No similar papers found.