Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a controllable singing voice conversion system to address key challenges in the field, including style leakage, unstable dynamic expression, and difficulty achieving high-fidelity synthesis under limited data. The approach incorporates a boundary-aware information bottleneck to suppress residual source-style artifacts, introduces an explicit frame-level technique matrix to enhance dynamic style rendering, and integrates targeted F0 processing with a perception-based high-frequency band completion mechanism to mitigate data scarcity. Evaluated in the SVCC2025 subjective assessment, the system achieves top-ranking naturalness, demonstrates strong speaker similarity and precise control over vocal techniques, and accomplishes these results using significantly less additional singing data than current state-of-the-art systems.
📝 Abstract
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leakage, dynamic rendering, and high-fidelity generation with limited data, we introduce three key innovations: a boundary-aware Whisper bottleneck that pools phoneme-span representations to suppress residual source style while preserving linguistic content; an explicit frame-level technique matrix, enhanced by targeted F0 processing during inference, for stable and distinct dynamic style rendering; and a perceptually motivated high-frequency band completion strategy that leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum, thereby overcoming data scarcity without overfitting. In the official SVCC2025 subjective evaluation, our system achieves the best naturalness performance among all submissions while maintaining competitive results in speaker similarity and technique control, despite using significantly less extra singing data than other top-performing systems. Audio samples are available online.
Problem

Research questions and friction points this paper is trying to address.

singing style conversion
style leakage
dynamic rendering
high-fidelity generation
limited data
Innovation

Methods, ideas, or system contributions that make the work stand out.

boundary-aware information bottleneck
frame-level technique matrix
high-frequency band completion
singing voice conversion
F0 processing
🔎 Similar Papers
No similar papers found.
Z
Zhetao Hu
School of Software Engineering, Xi’an Jiaotong University, Xi’an, China; SYKI-SPEECH Team, Xi’an, China
Y
Yiquan Zhou
School of Software Engineering, Xi’an Jiaotong University, Xi’an, China; SYKI-SPEECH Team, Xi’an, China
W
Wenyu Wang
School of Software Engineering, Xi’an Jiaotong University, Xi’an, China; SYKI-SPEECH Team, Xi’an, China
Zhiyu Wu
Zhiyu Wu
DeepSeek-AI, 北京大学
MLLMEmotion RecognitionSemi-Supervised Learning
X
Xin Gao
Division of Music and Audio Union Wheatland Culture and Media Ltd., China
J
Jihua Zhu
School of Software Engineering, Xi’an Jiaotong University, Xi’an, China; SYKI-SPEECH Team, Xi’an, China