Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This study addresses the limited capacity of existing self-supervised speech models to capture deep phylogenetic relationships among languages, which often reflect only geographic proximity or superficial typological similarities. By scaling the S3M model from 126 to 4,017 languages, the work demonstrates for the first time the emergence of robust phylogenetic signals in ultra-large-scale self-supervised speech representations. Through language identification tasks, high-dimensional embedding analysis, and acoustic feature examination, the research reveals that the 4K-language model substantially enhances phylogenetic recovery, accurately models complex linguistic contact structures, and forms a stable macro-clustering encompassing Papuan, Oceanic, and Australian languages in the Pacific region. This clustering is driven by underlying acoustic properties, including global energy dynamics.

Technology Category

Application Category

📝 Abstract

Similarities between language representations derived from Self-Supervised Speech Models (S3Ms) have been observed to primarily reflect geographic proximity or surface typological similarities driven by recent expansion or contact, potentially missing deeper genealogical signals. We investigate how scaling linguistic coverage of an S3M-based language identification system from 126 to 4,017 languages influences this topology. Our results reveal a non-linear effect: while phylogenetic recovery remains stagnant up to the 1K scale, the 4K model displays a dramatic qualitative shift, resolving both clear lineages and complex, long-term linguistic contact. Notably, our analysis reveals the emergence of a robust macro-cluster in the Pacific (comprising Papuan, Oceanic, and Australian languages) and investigates its latent drivers. We find that the 4K model utilizes a more concentrated encoding that captures shared, robust acoustic signatures such as global energy dynamics. These findings suggest that massive S3Ms can internalize multiple layers of language history, providing a promising perspective for computational phylogenetics and the study of language contact.

Problem

Research questions and friction points this paper is trying to address.

Self-Supervised Speech Models

linguistic phylogeny

language contact

genealogical signals

language representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Supervised Speech Models

language phylogeny

large-scale modeling