SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Human infants acquire phonemic units from just hundreds of hours of speech, whereas current self-supervised speech models require orders-of-magnitude more data—revealing a critical data-efficiency gap. This paper introduces a general-purpose speech representation learning framework for rapid low-resource language adaptation. We propose Multi-task Adaptive Pre-Training (MAdaPT) and a First-Order Bilevel Optimization (FOBLO) algorithm, enhanced by interleaved supervised initialization to improve meta-training stability. The approach is architecture-agnostic and biologically inspired, requiring less than one hour of unlabeled target-language speech to learn highly discriminative representations. Evaluated on ABX, sWUGGY, sBLIMP, and tSC benchmarks, our method consistently outperforms prior models: one-hour adaptation achieves performance comparable to standard training with 100× more data. Code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract
Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry self-supervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pre-training (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize meta-training by using a robust initialization through interleaved supervision which alternates self-supervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of target-language audio, over $100 imes$ more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr-adapt.
Problem

Research questions and friction points this paper is trying to address.

Develops a speech model adapting to new languages with minimal data
Addresses inefficiency of current models requiring extensive training data
Enables rapid learning from under one hour of target language audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta-learning framework for few-shot speech adaptation
First-order bi-level optimization to reduce computation
Interleaved supervision stabilizing training with minimal data
🔎 Similar Papers
No similar papers found.