🤖 AI Summary
This work addresses the performance limitations in missing track retrieval for audio submixes caused by the loss of temporal information. To this end, the authors propose PHALAR, a contrastive learning framework that uniquely incorporates phase- and pitch-equivariant inductive biases. By integrating a learnable spectral pooling layer and a complex-valued output head, PHALAR preserves both phase coherence and temporal structure while enabling efficient matching. The model significantly improves alignment with human auditory perception and further supports zero-shot beat tracking and chord estimation. Evaluated on MoisesDB, Slakh, and ChocoChorales, PHALAR establishes new state-of-the-art results in retrieval accuracy—yielding approximately 70% relative improvement—while reducing model parameters by over 50% and accelerating training by 7× compared to prior methods.
📝 Abstract
Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.