🤖 AI Summary
This work proposes Fish Audio S2, an open-source text-to-speech (TTS) system that supports multi-speaker, multi-turn dialogue and enables natural language instruction-based control of speech generation. Addressing the limitations of existing open-source TTS models—which often lack instruction-following capabilities and sophisticated interactive features—the system employs a multi-stage training strategy and a phased data processing pipeline. This pipeline integrates video/audio caption generation, audio quality assessment, and reward modeling. Notably, Fish Audio S2 is the first open-source TTS system to incorporate both natural language-controlled generation and multi-turn, multi-speaker synthesis. The system is equipped with a production-grade SGLang inference engine, enabling streaming generation with a real-time factor (RTF) as low as 0.195 and initial audio latency under 100 milliseconds. Model weights and fine-tuning code are publicly released.
📝 Abstract
We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.