🤖 AI Summary
This study addresses the alignment between audio representations and human perception of timbre similarity. We propose a two-dimensional evaluation framework that quantifies both absolute consistency (i.e., correlation between embedding distances and subjective ratings) and rank-order agreement (e.g., Spearman correlation), complemented by multidimensional scaling and style embedding techniques—adapted from image style transfer—to systematically assess pre-trained audio representations (e.g., CLAP, sound-matching models) on canonical psychoacoustic datasets. Results demonstrate that CLAP and sound-matching models’ style embeddings significantly outperform conventional signal-processing features and generic audio embeddings across multiple metrics. This work provides the first empirical validation of cross-modal pre-trained models’ superiority and generalizability in timbre modeling. It establishes a novel paradigm for interpretable, perceptually aligned audio representation learning, bridging deep audio semantics with human auditory cognition.
📝 Abstract
Psychoacoustical so-called "timbre spaces" map perceptual similarity ratings of instrument sounds onto low-dimensional embeddings via multidimensional scaling, but suffer from scalability issues and are incapable of generalization. Recent results from audio (music and speech) quality assessment as well as image similarity have shown that deep learning is able to produce embeddings that align well with human perception while being largely free from these constraints. Although the existing human-rated timbre similarity data is not large enough to train deep neural networks (2,614 pairwise ratings on 334 audio samples), it can serve as test-only data for audio models. In this paper, we introduce metrics to assess the alignment of diverse audio representations with human judgments of timbre similarity by comparing both the absolute values and the rankings of embedding distances to human similarity ratings. Our evaluation involves three signal-processing-based representations, twelve representations extracted from pre-trained models, and three representations extracted from a novel sound matching model. Among them, the style embeddings inspired by image style transfer, extracted from the CLAP model and the sound matching model, remarkably outperform the others, showing their potential in modeling timbre similarity.