STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high manual annotation cost and fragmented nature of existing automatic singing annotation (ASA) methods in constructing high-quality singing voice synthesis (SVS) datasets, this paper proposes the first unified ASA framework that jointly performs phoneme alignment, note transcription, vocal technique identification, and global style labeling. We introduce a novel non-autoregressive local acoustic encoder integrated within a hierarchical modeling architecture—spanning frames, phonemes, words, notes, and sentences—to learn structured, multi-granularity acoustic representations. Experiments demonstrate consistent superiority over state-of-the-art ASA methods across all annotation tasks. The fine-grained annotations generated by our framework significantly improve the naturalness and stylistic controllability of downstream SVS models. This work establishes both a high-quality annotated dataset and a robust methodological foundation for controllable singing voice synthesis.

Technology Category

Application Category

📝 Abstract
Recent breakthroughs in singing voice synthesis (SVS) have heightened the demand for high-quality annotated datasets, yet manual annotation remains prohibitively labor-intensive and resource-intensive. Existing automatic singing annotation (ASA) methods, however, primarily tackle isolated aspects of the annotation pipeline. To address this fundamental challenge, we present STARS, which is, to our knowledge, the first unified framework that simultaneously addresses singing transcription, alignment, and refined style annotation. Our framework delivers comprehensive multi-level annotations encompassing: (1) precise phoneme-audio alignment, (2) robust note transcription and temporal localization, (3) expressive vocal technique identification, and (4) global stylistic characterization including emotion and pace. The proposed architecture employs hierarchical acoustic feature processing across frame, word, phoneme, note, and sentence levels. The novel non-autoregressive local acoustic encoders enable structured hierarchical representation learning. Experimental validation confirms the framework's superior performance across multiple evaluation dimensions compared to existing annotation approaches. Furthermore, applications in SVS training demonstrate that models utilizing STARS-annotated data achieve significantly enhanced perceptual naturalness and precise style control. This work not only overcomes critical scalability challenges in the creation of singing datasets but also pioneers new methodologies for controllable singing voice synthesis. Audio samples are available at https://gwx314.github.io/stars-demo/.
Problem

Research questions and friction points this paper is trying to address.

Unified framework for singing transcription, alignment, and style annotation
Addresses labor-intensive manual annotation in singing voice synthesis
Enables precise phoneme-audio alignment and expressive vocal techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for singing transcription and alignment
Hierarchical acoustic feature processing across levels
Non-autoregressive local acoustic encoders for representation
🔎 Similar Papers
No similar papers found.
W
Wenxiang Guo
Zhejiang University
Y
Yu Zhang
Zhejiang University
Changhao Pan
Changhao Pan
Zhejiang University
Multi-Modal Genarative AISinging Voice Synthesis
Zhiyuan Zhu
Zhiyuan Zhu
Shanghai Jiao Tong University
NLPASRTTS
R
Ruiqi Li
Zhejiang University
Z
Zhetao Chen
Zhejiang University
Wenhao Xu
Wenhao Xu
Unknown affiliation
F
Fei Wu
Zhejiang University
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing