🤖 AI Summary
To address inaccurate pitch-timbre disentanglement caused by the source-filter model and the lack of paired in-tune/out-of-tune data in neural pitch transformation, this paper proposes an unpaired pitch conversion framework based on adversarial representation learning. Methodologically, we introduce a novel pitch-invariant latent space modeling mechanism, integrated with cycle-consistent GANs to enable unpaired pitch mapping learning. Furthermore, we unify self-supervised representation learning with a neural vocoder to construct an end-to-end generative architecture. Experiments demonstrate that our method significantly improves synthesized audio quality (MOS ↑0.8) on both global key transposition and template-driven pitch conversion tasks, while strictly preserving the original singer’s timbral identity. It achieves state-of-the-art performance in pitch accuracy, naturalness, and timbre fidelity—outperforming prior approaches across all three metrics.
📝 Abstract
Pitch manipulation is the process of producers adjusting the pitch of an audio segment to a specific key and intonation, which is essential in music production. Neural-network-based pitch-manipulation systems have been popular in recent years due to their superior synthesis quality compared to classical DSP methods. However, their performance is still limited due to their inaccurate feature disentanglement using source-filter models and the lack of paired in- and out-of-tune training data. This work proposes Neurodyne to address these issues. Specifically, Neurodyne uses adversarial representation learning to learn a pitch-independent latent representation to avoid inaccurate disentanglement and cycle-consistency training to create paired training data implicitly. Experimental results on global-key and template-based pitch manipulation demonstrate the effectiveness of the proposed system, marking improved synthesis quality while maintaining the original singer identity.