FCPE: A Fast Context-based Pitch Estimation Model

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the significant degradation in pitch estimation (PE) performance for monaural audio under noisy conditions, this paper proposes a lightweight, context-aware PE algorithm optimized for MIDI transcription and singing voice conversion (SVC). Methodologically, we adopt the Lynx-Net architecture and integrate depthwise separable convolutions to markedly reduce computational cost while enhancing noise robustness; Mel-spectrogram features serve as input for efficient time-frequency modeling. Evaluated on the MIR-1K dataset, our method achieves a Raw Pitch Accuracy of 96.79% and operates at a real-time factor of 0.0062 (≈161× real-time) on a single RTX 4090 GPU—outperforming state-of-the-art approaches in both inference speed and accuracy. Our key contribution is the first application of Lynx-Net combined with depthwise separable convolutions to robust PE, achieving an optimal trade-off among high accuracy, ultra-low latency, and strong generalization across diverse acoustic conditions.

Technology Category

Application Category

📝 Abstract

Pitch estimation (PE) in monophonic audio is crucial for MIDI transcription and singing voice conversion (SVC), but existing methods suffer significant performance degradation under noise. In this paper, we propose FCPE, a fast context-based pitch estimation model that employs a Lynx-Net architecture with depth-wise separable convolutions to effectively capture mel spectrogram features while maintaining low computational cost and robust noise tolerance. Experiments show that our method achieves 96.79% Raw Pitch Accuracy (RPA) on the MIR-1K dataset, on par with the state-of-the-art methods. The Real-Time Factor (RTF) is 0.0062 on a single RTX 4090 GPU, which significantly outperforms existing algorithms in efficiency. Code is available at https://github.com/CNChTu/FCPE.

Problem

Research questions and friction points this paper is trying to address.

Robust monophonic pitch estimation under noisy conditions

Fast computational performance for real-time applications

Accurate MIDI transcription and singing voice conversion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lynx-Net architecture with depth-wise convolutions

Extracts mel spectrogram features efficiently

Maintains low computational cost and noise robustness

🔎 Similar Papers

No similar papers found.