Self-supervised restoration of singing voice degraded by pitch shifting using shallow diffusion

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a novel paradigm for pitch shifting that reframes the task as a self-supervised speech inpainting problem to mitigate vocal formant shifts and mechanical artifacts commonly observed in large pitch intervals. Leveraging a shallow Mel-spectrogram diffusion model conditioned on frame-level fundamental frequency (f0), loudness, and content features, the method restores natural-sounding singing from distorted pitch-shifted audio. Training data are automatically constructed via a self-supervised strategy, eliminating the need for manual annotations. Experimental results on a curated singing dataset demonstrate that the proposed approach significantly outperforms established baselines in both objective metrics and pairwise subjective listening tests, effectively reducing pitch-shifting artifacts while accurately preserving melodic contour and rhythmic structure.

Technology Category

Application Category

📝 Abstract
Pitch shifting has been an essential feature in singing voice production. However, conventional signal processing approaches exhibit well known trade offs such as formant shifts and robotic coloration that becomes more severe at larger transposition jumps. This paper targets high quality pitch shifting for singing by reframing it as a restoration problem: given an audio track that has been pitch shifted (and thus contaminated by artifacts), we recover a natural sounding performance while preserving its melody and timing. Specifically, we use a lightweight, mel space diffusion model driven by frame level acoustic features such as f0, volume, and content features. We construct training pairs in a self supervised manner by applying pitch shifts and reversing them to simulate realistic artifacts while retaining ground truth. On a curated singing set, the proposed approach substantially reduces pitch shift artifacts compared to representative classical baselines, as measured by both statistical metrics and pairwise acoustic measures. The results suggest that restoration based pitch shifting could be a viable approach towards artifact resistant transposition in vocal production workflows.
Problem

Research questions and friction points this paper is trying to address.

pitch shifting
singing voice
artifacts
formant shifts
robotic coloration
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised restoration
shallow diffusion
pitch shifting
singing voice
artifact reduction
🔎 Similar Papers
Yunyi Liu
Yunyi Liu
The University of Sydney
LLMVQAVisual GroundingReport GenerationMedical Image
T
Taketo Akama
Sony Computer Science Laboratories, Tokyo, Japan