Multilingual Phonological Feature Recognition with Self-Supervised Speech Models

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work proposes PhonoQ-2.0, the first end-to-end multilingual frame-level phonological feature recognition system capable of directly predicting 22-dimensional structured phonological features—including manner and place of articulation, voicing, and vowel quality—for each speech frame without relying on phoneme outputs. To ensure linguistically plausible predictions, the model incorporates a manner-of-articulation conditional gating mechanism that enforces consistency among predicted features. By integrating self-supervised speech representations, frame-level multi-label classification, and conditional gating, PhonoQ-2.0 achieves macro F1 scores of 91.3% and 88.9% on in-domain and out-of-domain evaluations, respectively—outperforming CTC-based phoneme baselines by an average of 8.7 F1 points and yielding gains of up to 10.8 percentage points on unseen languages, substantially surpassing existing approaches.

📝 Abstract

Phonological features provide a language-general and linguistically grounded representation of speech. We present PhonoQ-2.0, a multilingual frame-level phonological feature recognizer built on self-supervised speech models. The system directly predicts a structured 22-dimensional feature vector per frame encoding manner, vowel quality, place, and voicing, instead of deriving features from phoneme outputs. To ensure phonologically coherent predictions, we introduce a manner-conditioned gating mechanism that activates valid feature groups. Evaluated across multiple languages and corpora, PhonoQ-2.0 achieves an average macro-F1 of 91.3% in-domain and 88.9% out-of-domain. Compared to a strong CTC phoneme baseline, it delivers consistent gains of +8.8 F1 in-domain and +8.6 out-of-domain on average. In unseen-language evaluation, PhonoQ-2.0 improves macro-F1 from 66.9% to 73.6% (+6.7 on average), with gains of up to +10.8 points.

Problem

Research questions and friction points this paper is trying to address.

phonological features

multilingual speech recognition

self-supervised speech models

feature coherence

frame-level recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

phonological features

self-supervised speech models

multilingual speech recognition