🤖 AI Summary
This paper formally introduces and defines the “K-pop live vocal separation” task, addressing the challenge of highly overlapping pre-recorded accompaniment and live vocals in K-pop concerts. To tackle this, we propose a three-stage method integrating deep source separation, inter-channel cross-correlation-based time-delay correction, and adaptive amplitude rescaling: (1) coarse vocal-accompaniment separation; (2) phase alignment of vocal components across multi-channel recordings via cross-correlation; and (3) energy-consistency-driven amplitude re-scaling to suppress residual accompaniment. Evaluated on a newly curated K-pop live audio dataset, our approach significantly outperforms baseline models, achieving an average 3.2 dB improvement in Signal-to-Distortion Ratio (SDR). To the best of our knowledge, this is the first method to enable high-fidelity, automated extraction of live vocals under complex real-world mixing conditions—providing a practical technical foundation for fan-generated content creation, vocal performance analysis, and real-time interactive applications.
📝 Abstract
K-pop's global success is fueled by its dynamic performances and vibrant fan engagement. Inspired by K-pop fan culture, we propose a methodology for automatically extracting live vocals from performances. We use a combination of source separation, cross-correlation, and amplitude scaling to automatically remove pre-recorded vocals and instrumentals from a live performance. Our preliminary work introduces the task of live vocal separation and provides a foundation for future research in this topic.