🤖 AI Summary
Current polyphonic music transcription models exhibit poor generalization, struggling with unseen instruments and requiring a fixed number of sound sources. To address this, we propose a lightweight end-to-end transcription framework featuring a timbre-invariant backbone network and an auditory-cognition-inspired attention-based associative memory module, enabling dynamic timbre encoding and adaptive source separation. Our method generalizes to novel timbres using only 12.5 minutes of training data and eliminates the need for pre-specified source counts. By integrating deep clustering, synthetic data augmentation, and biologically inspired memory mechanisms, it achieves state-of-the-art performance on public benchmarks—surpassing prior work with approximately half the parameters. The separation module demonstrates significantly improved timbre discrimination, while transcription accuracy and robustness are jointly enhanced.
📝 Abstract
Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments and rigid source-count constraints. We address these limitations with a lightweight deep clustering solution featuring: 1) a timbre-agnostic backbone achieving state-of-the-art performance with only half the parameters of comparable models, and 2) a novel associative memory mechanism that mimics human auditory cognition to dynamically encode unseen timbres via attention-based clustering. Our biologically-inspired framework enables adaptive polyphonic separation with minimal training data (12.5 minutes), supported by a new synthetic dataset method offering cost-effective, high-precision multi-timbre generation. Experiments show the timbre-agnostic transcription model outperforms existing models on public benchmarks, while the separation module demonstrates promising timbre discrimination. This work provides an efficient framework for timbre-related music transcription and explores new directions for timbre-aware separation through cognitive-inspired architectures.