PHALAR: Phasors for Learned Musical Audio Representations

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

231K/year
🤖 AI Summary
This work addresses the performance limitations in missing track retrieval for audio submixes caused by the loss of temporal information. To this end, the authors propose PHALAR, a contrastive learning framework that uniquely incorporates phase- and pitch-equivariant inductive biases. By integrating a learnable spectral pooling layer and a complex-valued output head, PHALAR preserves both phase coherence and temporal structure while enabling efficient matching. The model significantly improves alignment with human auditory perception and further supports zero-shot beat tracking and chord estimation. Evaluated on MoisesDB, Slakh, and ChocoChorales, PHALAR establishes new state-of-the-art results in retrieval accuracy—yielding approximately 70% relative improvement—while reducing model parameters by over 50% and accelerating training by 7× compared to prior methods.
📝 Abstract
Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.
Problem

Research questions and friction points this paper is trying to address.

stem retrieval
musical audio representations
temporal information
audio submix
music source separation
Innovation

Methods, ideas, or system contributions that make the work stand out.

PHALAR
phase-equivariant
pitch-equivariant
learned spectral pooling
contrastive audio representation