Temporal Adaptation of Pre-trained Foundation Models for Music Structure Analysis

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

260K/year

🤖 AI Summary

Music structure analysis (MSA) suffers from inefficiency in modeling long audio sequences and temporal misalignment, primarily due to reliance on high-resolution, short-window features in existing pretrained music models. To address this, we propose a time-adaptive fine-tuning framework that enables single-pass, whole-track forward inference via audio window expansion and low-resolution feature adaptation. Our key contributions include: (1) temporal feature resampling to preserve structural semantics across scales; (2) cross-window feature alignment to ensure temporal consistency; and (3) a low-resolution feature adaptation mechanism that maintains representational fidelity without increasing memory footprint or inference latency. Evaluated on the Harmonix Set and RWC-Pop benchmarks, our method achieves significant improvements in boundary detection F1-score and structural function classification accuracy, demonstrating superior precision, computational efficiency, and robustness for long-sequence MSA.

Technology Category

Application Category

📝 Abstract

Audio-based music structure analysis (MSA) is an essential task in Music Information Retrieval that remains challenging due to the complexity and variability of musical form. Recent advances highlight the potential of fine-tuning pre-trained music foundation models for MSA tasks. However, these models are typically trained with high temporal feature resolution and short audio windows, which limits their efficiency and introduces bias when applied to long-form audio. This paper presents a temporal adaptation approach for fine-tuning music foundation models tailored to MSA. Our method enables efficient analysis of full-length songs in a single forward pass by incorporating two key strategies: (1) audio window extension and (2) low-resolution adaptation. Experiments on the Harmonix Set and RWC-Pop datasets show that our method significantly improves both boundary detection and structural function prediction, while maintaining comparable memory usage and inference speed.

Problem

Research questions and friction points this paper is trying to address.

Adapting pre-trained models for long-form music analysis

Reducing bias from short audio windows in MSA

Improving efficiency in full-length song structure analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal adaptation for music foundation models

Audio window extension for full-length analysis

Low-resolution adaptation maintains efficiency

🔎 Similar Papers

Unifying Multitrack Music Arrangement via Reconstruction Fine-Tuning and Efficient Tokenization