🤖 AI Summary
Human infants acquire phonemic units from just hundreds of hours of speech, whereas current self-supervised speech models require orders-of-magnitude more data—revealing a critical data-efficiency gap. This paper introduces a general-purpose speech representation learning framework for rapid low-resource language adaptation. We propose Multi-task Adaptive Pre-Training (MAdaPT) and a First-Order Bilevel Optimization (FOBLO) algorithm, enhanced by interleaved supervised initialization to improve meta-training stability. The approach is architecture-agnostic and biologically inspired, requiring less than one hour of unlabeled target-language speech to learn highly discriminative representations. Evaluated on ABX, sWUGGY, sBLIMP, and tSC benchmarks, our method consistently outperforms prior models: one-hour adaptation achieves performance comparable to standard training with 100× more data. Code and pretrained models are publicly released.
📝 Abstract
Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry self-supervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pre-training (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize meta-training by using a robust initialization through interleaved supervision which alternates self-supervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of target-language audio, over $100 imes$ more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr-adapt.