🤖 AI Summary
Existing open-source audio-text datasets lack fine-grained characterization of multidimensional singing attributes—acoustic properties, vocal styles, and demographic factors—hindering downstream tasks such as style description. This paper formally defines the novel task of *singing style description* for the first time. We introduce S2Cap, the first multimodal (audio–image–text) benchmark dataset comprehensively annotated with vocal performance, acoustic features, and demographic attributes. To address this task, we propose a lightweight captioning baseline featuring CRESCENDO, a cross-modal alignment strategy that mitigates misalignment arising from unimodal pretraining, and a disentangled supervision mechanism to enhance voice-focused representation learning. Evaluated on S2Cap, our method significantly outperforms existing state-of-the-art approaches, validating both the dataset’s utility and the efficacy of our framework. This work establishes a new paradigm for singing voice understanding and generation.
📝 Abstract
Singing voices contain much richer information than common voices, such as diverse vocal and acoustic characteristics. However, existing open-source audio-text datasets for singing voices capture only a limited set of attributes and lacks acoustic features, leading to limited utility towards downstream tasks, such as style captioning. To fill this gap, we formally consider the task of singing style captioning and introduce S2Cap, a singing voice dataset with comprehensive descriptions of diverse vocal, acoustic and demographic attributes. Based on this dataset, we develop a simple yet effective baseline algorithm for the singing style captioning. The algorithm utilizes two novel technical components: CRESCENDO for mitigating misalignment between pretrained unimodal models, and demixing supervision to regularize the model to focus on the singing voice. Despite its simplicity, the proposed method outperforms state-of-the-art baselines.