🤖 AI Summary
Existing singing synthesis systems are constrained by global timbre control, making it difficult to model dynamic multi-singer arrangements and rich acoustic textures within a single piece. This work proposes Tutti, a unified framework that enables flexible singer scheduling aligned with musical structure through a structure-aware singer prompting mechanism. Furthermore, it introduces a condition-guided variational autoencoder (VAE) to learn complementary acoustic textures by jointly leveraging explicit and implicit acoustic features. By moving beyond the limitations of conventional global timbre settings, the proposed method significantly outperforms existing approaches in both multi-singer scheduling accuracy and the acoustic realism of choral synthesis, establishing a new paradigm for complex polyphonic singing synthesis.
📝 Abstract
While existing Singing Voice Synthesis systems achieve high-fidelity solo performances, they are constrained by global timbre control, failing to address dynamic multi-singer arrangement and vocal texture within a single song. To address this, we propose Tutti, a unified framework designed for structured multi-singer generation. Specifically, we introduce a Structure-Aware Singer Prompt to enable flexible singer scheduling evolving with musical structure, and propose Complementary Texture Learning via Condition-Guided VAE to capture implicit acoustic textures (e.g., spatial reverberation and spectral fusion) that are complementary to explicit controls. Experiments demonstrate that Tutti excels in precise multi-singer scheduling and significantly enhances the acoustic realism of choral generation, offering a novel paradigm for complex multi-singer arrangement. Audio samples are available at https://annoauth123-ctrl.github.io/Tutii_Demo/.