🤖 AI Summary
Music structure analysis (MSA) suffers from inefficiency in modeling long audio sequences and temporal misalignment, primarily due to reliance on high-resolution, short-window features in existing pretrained music models. To address this, we propose a time-adaptive fine-tuning framework that enables single-pass, whole-track forward inference via audio window expansion and low-resolution feature adaptation. Our key contributions include: (1) temporal feature resampling to preserve structural semantics across scales; (2) cross-window feature alignment to ensure temporal consistency; and (3) a low-resolution feature adaptation mechanism that maintains representational fidelity without increasing memory footprint or inference latency. Evaluated on the Harmonix Set and RWC-Pop benchmarks, our method achieves significant improvements in boundary detection F1-score and structural function classification accuracy, demonstrating superior precision, computational efficiency, and robustness for long-sequence MSA.
📝 Abstract
Audio-based music structure analysis (MSA) is an essential task in Music Information Retrieval that remains challenging due to the complexity and variability of musical form. Recent advances highlight the potential of fine-tuning pre-trained music foundation models for MSA tasks. However, these models are typically trained with high temporal feature resolution and short audio windows, which limits their efficiency and introduces bias when applied to long-form audio. This paper presents a temporal adaptation approach for fine-tuning music foundation models tailored to MSA. Our method enables efficient analysis of full-length songs in a single forward pass by incorporating two key strategies: (1) audio window extension and (2) low-resolution adaptation. Experiments on the Harmonix Set and RWC-Pop datasets show that our method significantly improves both boundary detection and structural function prediction, while maintaining comparable memory usage and inference speed.