🤖 AI Summary
To address the significant degradation in pitch estimation (PE) performance for monaural audio under noisy conditions, this paper proposes a lightweight, context-aware PE algorithm optimized for MIDI transcription and singing voice conversion (SVC). Methodologically, we adopt the Lynx-Net architecture and integrate depthwise separable convolutions to markedly reduce computational cost while enhancing noise robustness; Mel-spectrogram features serve as input for efficient time-frequency modeling. Evaluated on the MIR-1K dataset, our method achieves a Raw Pitch Accuracy of 96.79% and operates at a real-time factor of 0.0062 (≈161× real-time) on a single RTX 4090 GPU—outperforming state-of-the-art approaches in both inference speed and accuracy. Our key contribution is the first application of Lynx-Net combined with depthwise separable convolutions to robust PE, achieving an optimal trade-off among high accuracy, ultra-low latency, and strong generalization across diverse acoustic conditions.
📝 Abstract
Pitch estimation (PE) in monophonic audio is crucial for MIDI transcription and singing voice conversion (SVC), but existing methods suffer significant performance degradation under noise. In this paper, we propose FCPE, a fast context-based pitch estimation model that employs a Lynx-Net architecture with depth-wise separable convolutions to effectively capture mel spectrogram features while maintaining low computational cost and robust noise tolerance. Experiments show that our method achieves 96.79% Raw Pitch Accuracy (RPA) on the MIR-1K dataset, on par with the state-of-the-art methods. The Real-Time Factor (RTF) is 0.0062 on a single RTX 4090 GPU, which significantly outperforms existing algorithms in efficiency. Code is available at https://github.com/CNChTu/FCPE.