Serenade: A Singing Style Conversion Framework Based On Audio Infilling

📅 2025-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three core challenges in singing style conversion (SSC): inadequate target-style modeling, insufficient source-style disentanglement, and low melodic fidelity. We propose the first audio mask-filling-based paradigm for style modeling; employ flow matching to model speech distributions; integrate cycle-consistent training for acoustic feature disentanglement; and introduce an F0-aware post-processing module that leverages pitch contours to guide waveform resynthesis, thereby enhancing pitch accuracy and naturalness. Our method synergistically combines audio inpainting, flow matching, source-filter vocoding, and F0-guided synthesis. Experiments demonstrate state-of-the-art overall similarity in generalized SSC tasks, significantly improved conversion of complex vocal qualities (e.g., breathy and mixed voice), effective mitigation of pitch drift, and markedly enhanced naturalness—albeit with a marginal trade-off in style similarity.

Technology Category

Application Category

📝 Abstract
We propose Serenade, a novel framework for the singing style conversion (SSC) task. Although singer identity conversion has made great strides in the previous years, converting the singing style of a singer has been an unexplored research area. We find three main challenges in SSC: modeling the target style, disentangling source style, and retaining the source melody. To model the target singing style, we use an audio infilling task by predicting a masked segment of the target mel-spectrogram with a flow-matching model using the complement of the masked target mel-spectrogram along with disentangled acoustic features. On the other hand, to disentangle the source singing style, we use a cyclic training approach, where we use synthetic converted samples as source inputs and reconstruct the original source mel-spectrogram as a target. Finally, to retain the source melody better, we investigate a post-processing module using a source-filter-based vocoder and resynthesize the converted waveforms using the original F0 patterns. Our results showed that the Serenade framework can handle generalized SSC tasks with the best overall similarity score, especially in modeling breathy and mixed singing styles. Moreover, although resynthesizing with the original F0 patterns alleviated out-of-tune singing and improved naturalness, we found a slight tradeoff in similarity due to not changing the F0 patterns into the target style.
Problem

Research questions and friction points this paper is trying to address.

Convert singing style while retaining source melody
Model target singing style using audio infilling
Disentangle source singing style via cyclic training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio infilling predicts masked mel-spectrogram segments
Cyclic training disentangles source singing style
Post-processing retains source melody using F0 patterns
🔎 Similar Papers
No similar papers found.