🤖 AI Summary
Prior work lacks systematic evaluation of whether joint language–audio embedding models effectively encode human-perceived timbral semantics (e.g., brightness, roughness, warmth).
Method: We introduce the first psychoacoustically aligned benchmark to quantitatively assess cross-modal timbral representation capabilities of MS-CLAP, LAION-CLAP, and MuQ-MuLan on instrumental and sound-effect audio. Leveraging human-annotated timbral attributes as supervisory signals, we combine multimodal embedding analysis with perceptual dimension mapping to measure modeling fidelity of timbral semantics in text–audio alignment.
Results: LAION-CLAP achieves superior and most consistent alignment across all perceptual dimensions—significantly outperforming competing models—demonstrating that its shared embedding space more faithfully reflects the structure of human timbre perception. This work establishes a novel, interpretable benchmark for audio semantic modeling and provides empirical grounding for perceptually grounded representation learning.
📝 Abstract
Understanding and modeling the relationship between language and sound is critical for applications such as music information retrieval,text-guided music generation, and audio captioning. Central to these tasks is the use of joint language-audio embedding spaces, which map textual descriptions and auditory content into a shared embedding space. While multimodal embedding models such as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning language and audio, their correspondence to human perception of timbre, a multifaceted attribute encompassing qualities such as brightness, roughness, and warmth, remains underexplored. In this paper, we evaluate the above three joint language-audio embedding models on their ability to capture perceptual dimensions of timbre. Our findings show that LAION-CLAP consistently provides the most reliable alignment with human-perceived timbre semantics across both instrumental sounds and audio effects.