🤖 AI Summary
Current speech synthesis research is hindered by the scarcity of high-quality, culturally specific speech corpora—particularly for underrepresented speaker identities such as Japanese young female live idols—leading to inadequate speaker similarity evaluation in TTS and voice conversion (VC) systems and impeding listener-preference-driven personalized voice modeling. To address this, we introduce JIS, the first open-source, ethically compliant speech corpus specifically designed for this demographic. JIS encompasses diverse spoken styles and features fine-grained metadata annotation, speaker anonymization, and contextual cultural documentation. It fills a critical gap in culturally grounded speech resources and significantly improves speaker identifiability modeling accuracy. As a benchmark, JIS enables more rigorous similarity assessment in TTS/VC and facilitates preference-aware voice generation. The corpus is freely available for non-commercial use and accompanied by analytical tools and practical guidelines.
📝 Abstract
We construct Japanese Idol Speech Corpus (JIS) to advance research in speech generation AI, including text-to-speech synthesis (TTS) and voice conversion (VC). JIS will facilitate more rigorous evaluations of speaker similarity in TTS and VC systems since all speakers in JIS belong to a highly specific category: "young female live idols" in Japan, and each speaker is identified by a stage name, enabling researchers to recruit listeners familiar with these idols for listening experiments. With its unique speaker attributes, JIS will foster compelling research, including generating voices tailored to listener preferences-an area not yet widely studied. JIS will be distributed free of charge to promote research in speech generation AI, with usage restricted to non-commercial, basic research. We describe the construction of JIS, provide an overview of Japanese live idol culture to support effective and ethical use of JIS, and offer a basic analysis to guide application of JIS.