Efficient Adapter Tuning for Joint Singing Voice Beat and Downbeat Tracking With Self-Supervised Learning Features

📅 2025-03-13

🏛️ International Society for Music Information Retrieval Conference

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Joint beat and downbeat tracking in a cappella vocal music is highly challenging due to the absence of stable rhythmic and harmonic cues typically provided by instrumental accompaniment. Method: This paper proposes a lightweight temporal convolutional adapter fine-tuning framework. It is the first to integrate self-supervised DistilHuBERT speech-semantic representations with conventional spectral features, enabling end-to-end joint beat and downbeat estimation via parameter-efficient adapter modules. Contribution/Results: Key innovations include (1) synergistic modeling of self-supervised representations and adapter-based fine-tuning, and (2) a dynamic multi-source feature fusion mechanism tailored to vocal heterogeneity. Experiments demonstrate absolute F1-score improvements of 31.6% for beat tracking and 42.4% for downbeat tracking in a cappella settings—substantially outperforming baseline systems and markedly enhancing robustness.

Technology Category

Application Category

📝 Abstract

Singing voice beat tracking is a challenging task, due to the lack of musical accompaniment that often contains robust rhythmic and harmonic patterns, something most existing beat tracking systems utilize and can be essential for estimating beats. In this paper, a novel temporal convolutional network-based beat-tracking approach featuring self-supervised learning (SSL) representations and adapter tuning is proposed to track the beat and downbeat of singing voices jointly. The SSL DistilHuBERT representations are utilized to capture the semantic information of singing voices and are further fused with the generic spectral features to facilitate beat estimation. Sources of variabilities that are particularly prominent with the non-homogeneous singing voice data are reduced by the efficient adapter tuning. Extensive experiments show that feature fusion and adapter tuning improve the performance individually, and the combination of both leads to significantly better performances than the un-adapted baseline system, with up to 31.6% and 42.4% absolute F1-score improvements on beat and downbeat tracking, respectively.

Problem

Research questions and friction points this paper is trying to address.

Develops a method for singing voice beat and downbeat tracking.

Utilizes self-supervised learning and adapter tuning for feature enhancement.

Improves F1-scores significantly for beat and downbeat tracking.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning for singing voice analysis

Adapter tuning reduces variability in singing data

Feature fusion enhances beat and downbeat tracking

🔎 Similar Papers

No similar papers found.