Multilingual Phonological Feature Recognition with Self-Supervised Speech Models

📅 2026-05-25
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
This work proposes PhonoQ-2.0, the first end-to-end multilingual frame-level phonological feature recognition system capable of directly predicting 22-dimensional structured phonological features—including manner and place of articulation, voicing, and vowel quality—for each speech frame without relying on phoneme outputs. To ensure linguistically plausible predictions, the model incorporates a manner-of-articulation conditional gating mechanism that enforces consistency among predicted features. By integrating self-supervised speech representations, frame-level multi-label classification, and conditional gating, PhonoQ-2.0 achieves macro F1 scores of 91.3% and 88.9% on in-domain and out-of-domain evaluations, respectively—outperforming CTC-based phoneme baselines by an average of 8.7 F1 points and yielding gains of up to 10.8 percentage points on unseen languages, substantially surpassing existing approaches.
📝 Abstract
Phonological features provide a language-general and linguistically grounded representation of speech. We present PhonoQ-2.0, a multilingual frame-level phonological feature recognizer built on self-supervised speech models. The system directly predicts a structured 22-dimensional feature vector per frame encoding manner, vowel quality, place, and voicing, instead of deriving features from phoneme outputs. To ensure phonologically coherent predictions, we introduce a manner-conditioned gating mechanism that activates valid feature groups. Evaluated across multiple languages and corpora, PhonoQ-2.0 achieves an average macro-F1 of 91.3% in-domain and 88.9% out-of-domain. Compared to a strong CTC phoneme baseline, it delivers consistent gains of +8.8 F1 in-domain and +8.6 out-of-domain on average. In unseen-language evaluation, PhonoQ-2.0 improves macro-F1 from 66.9% to 73.6% (+6.7 on average), with gains of up to +10.8 points.
Problem

Research questions and friction points this paper is trying to address.

phonological features
multilingual speech recognition
self-supervised speech models
feature coherence
frame-level recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

phonological features
self-supervised speech models
multilingual speech recognition
manner-conditioned gating
frame-level prediction
🔎 Similar Papers
No similar papers found.
A
Abner Hernandez
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Germany
T
TomĂĄs Arias-Vergara
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Germany; GITA Lab. Facultad de IngenierĂ­a. Universidad de Antioquia UdeA, MedellĂ­n, Colombia
D
Daiqi Liu
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Germany
A
Andreas Maier
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Germany
Paula Andrea Pérez-Toro
Paula Andrea Pérez-Toro
Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg; Universidad de Antioquia
Machine LearningSpeech AnalysisGait AnalysisNatural Language ProcessingDeep Learning