🤖 AI Summary
Addressing the absence of a unique “ground-truth” annotation for rhythm guitar transcription in multi-track music, this paper proposes a three-stage automatic transcription framework: (1) guitar source separation to approximate the guitar stem; (2) precise pluck onset detection using the pre-trained audio foundation model MERT; and (3) mapping of pluck sequences onto an expert-constructed, human-readable rhythmic dictionary to generate structured rhythmic notation—including barlines and time signatures. Key contributions include: (1) the first guitar rhythm dataset featuring expert annotations and extensive real-world recordings; (2) a rhythm decoding method jointly optimized for human readability and metrical structure awareness; and (3) a dedicated evaluation metric suite for rhythm sequences. Ablation studies and error analysis demonstrate high accuracy and robustness in polyphonic scenarios.
📝 Abstract
Whereas chord transcription has received considerable attention during the past couple of decades, far less work has been devoted to transcribing and encoding the rhythmic patterns that occur in a song. The topic is especially relevant for instruments such as the rhythm guitar, which is typically played by strumming rhythmic patterns that repeat and vary over time. However, in many cases one cannot objectively define a single "right" rhythmic pattern for a given song section. To create a dataset with well-defined ground-truth labels, we asked expert musicians to transcribe the rhythmic patterns in 410 popular songs and record cover versions where the guitar tracks followed those transcriptions. To transcribe the strums and their corresponding rhythmic patterns, we propose a three-step framework. Firstly, we perform approximate stem separation to extract the guitar part from the polyphonic mixture. Secondly, we detect individual strums within the separated guitar audio, using a pre-trained foundation model (MERT) as a backbone. Finally, we carry out a pattern-decoding process in which the transcribed sequence of guitar strums is represented by patterns drawn from an expert-curated vocabulary. We show that it is possible to transcribe the rhythmic patterns of the guitar track in polyphonic music with quite high accuracy, producing a representation that is human-readable and includes automatically detected bar lines and time signature markers. We perform ablation studies and error analysis and propose a set of evaluation metrics to assess the accuracy and readability of the predicted rhythmic pattern sequence.