Drum-to-Vocal Percussion Sound Conversion and Its Evaluation Methodology

šŸ“… 2025-09-20
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This work formally defines and addresses the cross-domain timbre transfer task from drum sounds to vocal percussion (VP). Due to VP’s strong transients, aperiodicity, and absence of linguistic structure, conventional speech or singing synthesis methods are ill-suited. To tackle this, we propose three objective evaluation metrics and establish a systematic subjective assessment framework. We employ RAVE-based neural audio synthesis and comparatively investigate vector-quantized (VQ-RAVE) versus non-quantized variants for timbre modeling. Subjective listening tests demonstrate that both models generate perceptually plausible VP audio; however, VQ-RAVE achieves superior and more consistent performance in rhythmic fidelity, timbre consistency, and naturalness. This study establishes a novel paradigm and reproducible technical pipeline for a cappella vocal music generation.

Technology Category

Application Category

šŸ“ Abstract
This paper defines the novel task of drum-to-vocal percussion (VP) sound conversion. VP imitates percussion instruments through human vocalization and is frequently employed in contemporary a cappella music. It exhibits acoustic properties distinct from speech and singing (e.g., aperiodicity, noisy transients, and the absence of linguistic structure), making conventional speech or singing synthesis methods unsuitable. We thus formulate VP synthesis as a timbre transfer problem from drum sounds, leveraging their rhythmic and timbral correspondence. To support this formulation, we define three requirements for successful conversion: rhythmic fidelity, timbral consistency, and naturalness as VP. We also propose corresponding subjective evaluation criteria. We implement two baseline conversion methods using a neural audio synthesizer, the real-time audio variational autoencoder (RAVE), with and without vector quantization (VQ). Subjective experiments show that both methods produce plausible VP outputs, with the VQ-based RAVE model yielding more consistent conversion.
Problem

Research questions and friction points this paper is trying to address.

Converting drum sounds to vocal percussion while preserving rhythm
Addressing unique acoustic properties distinct from speech synthesis
Establishing evaluation criteria for rhythmic and timbral conversion quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Drum-to-vocal percussion conversion as timbre transfer
Implemented using RAVE neural audio synthesizer
Vector quantization improves timbral consistency
šŸ”Ž Similar Papers
No similar papers found.