Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses three key challenges in automatic pronunciation assessment (APA) using large multimodal models (LMMs): (1) insufficient fine-grained modeling capability—particularly at the phoneme level; (2) inconsistent evaluation outcomes between Pearson and Spearman correlation coefficients; and (3) limited phoneme-level assessment accuracy. To tackle these, we propose a supervised fine-tuning framework for LMMs that jointly processes speech and text as dual-modal inputs. Trained on the Speechocean762 benchmark and a proprietary dataset, our model achieves word- and sentence-level scoring accuracy with Pearson correlation coefficients (PCC) up to 0.9—comparable to leading commercial systems. We further conduct the first systematic analysis of LMM performance across granularity levels (phoneme, word, sentence), revealing significant degradation at finer scales. Crucially, we empirically demonstrate that Spearman correlation coefficient (SCC) is more robust and appropriate than PCC as the primary metric for APA evaluation. Our work provides both methodological guidance and empirical validation for leveraging multimodal models in fine-grained pronunciation assessment.

Technology Category

Application Category

📝 Abstract
Automatic Pronunciation Assessment (APA) is critical for Computer-Assisted Language Learning (CALL), requiring evaluation across multiple granularities and aspects. Large Multimodal Models (LMMs) present new opportunities for APA, but their effectiveness in fine-grained assessment remains uncertain. This work investigates fine-tuning LMMs for APA using the Speechocean762 dataset and a private corpus. Fine-tuning significantly outperforms zero-shot settings and achieves competitive results on single-granularity tasks compared to public and commercial systems. The model performs well at word and sentence levels, while phoneme-level assessment remains challenging. We also observe that the Pearson Correlation Coefficient (PCC) reaches 0.9, whereas Spearman's rank Correlation Coefficient (SCC) remains around 0.6, suggesting that SCC better reflects ordinal consistency. These findings highlight both the promise and limitations of LMMs for APA and point to future work on fine-grained modeling and rank-aware evaluation.
Problem

Research questions and friction points this paper is trying to address.

Fine-tuning multimodal models for pronunciation assessment
Evaluating granular performance at phoneme, word, sentence levels
Investigating correlation metrics for ordinal consistency evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned Large Multimodal Models for pronunciation assessment
Outperforms zero-shot settings on multiple granularities
Achieves 0.9 Pearson correlation with ordinal consistency analysis
🔎 Similar Papers
No similar papers found.
K
Ke Wang
Microsoft, Beijing, China
W
Wenning Wei
Microsoft, Beijing, China
Y
Yan Deng
Microsoft, Beijing, China
L
Lei He
Microsoft, Beijing, China
Sheng Zhao
Sheng Zhao
Microsoft
Speech