EchoVoices: Preserving Generational Voices and Memories for Seniors and Children

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing ASR, TTS, and LLM technologies exhibit poor robustness for atypical speech (e.g., elderly and child speakers), low naturalness in synthesized speech, and weak dialogue consistency. To address these challenges, this work proposes an end-to-end cross-generational digital human system: (1) a k-NN-augmented Whisper model to enhance ASR robustness for atypical speech; (2) an age-adaptive VITS architecture enabling high-fidelity, speaker-similar voice synthesis; and (3) an LLM-driven RAG-based memory system ensuring coherent, personalized intergenerational dialogue. Evaluated on SeniorTalk and ChildMandarin benchmarks, the system achieves significant improvements in ASR accuracy, MOS speech quality scores, and speaker similarity (SIM). Results demonstrate its effectiveness and practicality for voice digitization preservation and intergenerational memory inheritance.

Technology Category

Application Category

📝 Abstract
Recent breakthroughs in intelligent speech and digital human technologies have primarily targeted mainstream adult users, often overlooking the distinct vocal patterns and interaction styles of seniors and children. These demographics possess distinct vocal characteristics, linguistic styles, and interaction patterns that challenge conventional ASR, TTS, and LLM systems. To address this, we introduce EchoVoices, an end-to-end digital human pipeline dedicated to creating persistent digital personas for seniors and children, ensuring their voices and memories are preserved for future generations. Our system integrates three core innovations: a k-NN-enhanced Whisper model for robust speech recognition of atypical speech; an age-adaptive VITS model for high-fidelity, speaker-aware speech synthesis; and an LLM-driven agent that automatically generates persona cards and leverages a RAG-based memory system for conversational consistency. Our experiments, conducted on the SeniorTalk and ChildMandarin datasets, demonstrate significant improvements in recognition accuracy, synthesis quality, and speaker similarity. EchoVoices provides a comprehensive framework for preserving generational voices, offering a new means of intergenerational connection and the creation of lasting digital legacies.
Problem

Research questions and friction points this paper is trying to address.

Addresses overlooked vocal patterns of seniors and children
Enhances ASR and TTS for atypical speech and age groups
Preserves generational voices and memories via digital personas
Innovation

Methods, ideas, or system contributions that make the work stand out.

k-NN-enhanced Whisper model for atypical speech recognition
age-adaptive VITS model for speaker-aware synthesis
LLM-driven agent with RAG-based memory system
🔎 Similar Papers
No similar papers found.
H
Haiying Xu
College of Computer Science, Nankai University
H
Haoze Liu
College of Computer Science, Nankai University
M
Mingshi Li
College of Computer Science, Nankai University
S
Siyu Cai
College of Computer Science, Nankai University
G
Guangxuan Zheng
College of Computer Science, Nankai University
Y
Yuhuang Jia
College of Computer Science, Nankai University
Jinghua Zhao
Jinghua Zhao
Nankai University
Y
Yong Qin
College of Computer Science, Nankai University