🤖 AI Summary
This work proposes the concept of “musical metamers,” drawing an analogy to metamerism in color science, to generate audio segments that are perceptually indistinguishable yet exhibit distinct waveforms—without relying on manual preprocessing such as transcription, beat tracking, or source separation. The approach leverages joint time–frequency scattering (JTFS) within the Kymatio framework, enhanced with GPU acceleration and automatic differentiation, and integrates spectro-temporal receptive fields (STRFs), modulation power spectra (MPS), and Gabor filter banks to enable end-to-end generation. Experimental results demonstrate that the method effectively preserves perceptual features salient to human hearing, thereby validating the utility and novelty of JTFS for modeling music perception.
📝 Abstract
The concept of metamerism originates from colorimetry, where it describes a sensation of visual similarity between two colored lights despite significant differences in spectral content. Likewise, we propose to call ``musical metamerism''the sensation of auditory similarity which is elicited by two music fragments which differ in terms of underlying waveforms. In this technical report, we describe a method to generate musical metamers from any audio recording. Our method is based on joint time--frequency scattering in Kymatio, an open-source software in Python which enables GPU computing and automatic differentiation. The advantage of our method is that it does not require any manual preprocessing, such as transcription, beat tracking, or source separation. We provide a mathematical description of JTFS as well as some excerpts from the Kymatio source code. Lastly, we review the prior work on JTFS and draw connections with closely related algorithms, such as spectrotemporal receptive fields (STRF), modulation power spectra (MPS), and Gabor filterbank (GBFB).