đ€ AI Summary
This work proposes PhonoQ-2.0, the first end-to-end multilingual frame-level phonological feature recognition system capable of directly predicting 22-dimensional structured phonological featuresâincluding manner and place of articulation, voicing, and vowel qualityâfor each speech frame without relying on phoneme outputs. To ensure linguistically plausible predictions, the model incorporates a manner-of-articulation conditional gating mechanism that enforces consistency among predicted features. By integrating self-supervised speech representations, frame-level multi-label classification, and conditional gating, PhonoQ-2.0 achieves macro F1 scores of 91.3% and 88.9% on in-domain and out-of-domain evaluations, respectivelyâoutperforming CTC-based phoneme baselines by an average of 8.7 F1 points and yielding gains of up to 10.8 percentage points on unseen languages, substantially surpassing existing approaches.
đ Abstract
Phonological features provide a language-general and linguistically grounded representation of speech. We present PhonoQ-2.0, a multilingual frame-level phonological feature recognizer built on self-supervised speech models. The system directly predicts a structured 22-dimensional feature vector per frame encoding manner, vowel quality, place, and voicing, instead of deriving features from phoneme outputs. To ensure phonologically coherent predictions, we introduce a manner-conditioned gating mechanism that activates valid feature groups. Evaluated across multiple languages and corpora, PhonoQ-2.0 achieves an average macro-F1 of 91.3% in-domain and 88.9% out-of-domain. Compared to a strong CTC phoneme baseline, it delivers consistent gains of +8.8 F1 in-domain and +8.6 out-of-domain on average. In unseen-language evaluation, PhonoQ-2.0 improves macro-F1 from 66.9% to 73.6% (+6.7 on average), with gains of up to +10.8 points.