Enhancing the Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling

📅 2025-09-06

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address performance degradation in context-aware automatic speech recognition (ASR) caused by variable-length prompt lists, this paper proposes a multi-granularity semantic relevance joint modeling framework. It simultaneously models semantic associations between prompts and speech at the list-level, phrase-level, and token-level, and introduces a grouped competitive purification mechanism that dynamically selects the most discriminative prompt subset rather than using all prompts. Cross-attention enables cross-granularity relevance computation and information fusion, significantly enhancing model robustness to variable-length biased inputs. On AISHELL-1 and KeSpeech, the method achieves relative F1-score improvements of up to 21.34% and 28.46% over baseline systems, with consistent gains across diverse prompt lengths. The core contributions are (i) a three-level joint semantic modeling architecture and (ii) a competitive prompt purification strategy—establishing a scalable, robust paradigm for prompt utilization in context-aware ASR.

Technology Category

Application Category

📝 Abstract

Recently, cross-attention-based contextual automatic speech recognition (ASR) models have made notable advancements in recognizing personalized biasing phrases. However, the effectiveness of cross-attention is affected by variations in biasing information volume, especially when the length of the biasing list increases significantly. We find that, regardless of the length of the biasing list, only a limited amount of biasing information is most relevant to a specific ASR intermediate representation. Therefore, by identifying and integrating the most relevant biasing information rather than the entire biasing list, we can alleviate the effects of variations in biasing information volume for contextual ASR. To this end, we propose a purified semantic correlation joint modeling (PSC-Joint) approach. In PSC-Joint, we define and calculate three semantic correlations between the ASR intermediate representations and biasing information from coarse to fine: list-level, phrase-level, and token-level. Then, the three correlations are jointly modeled to produce their intersection, so that the most relevant biasing information across various granularities is highlighted and integrated for contextual recognition. In addition, to reduce the computational cost introduced by the joint modeling of three semantic correlations, we also propose a purification mechanism based on a grouped-and-competitive strategy to filter out irrelevant biasing phrases. Compared with baselines, our PSC-Joint approach achieves average relative F1 score improvements of up to 21.34% on AISHELL-1 and 28.46% on KeSpeech, across biasing lists of varying lengths.

Problem

Research questions and friction points this paper is trying to address.

Improving ASR robustness to varying biasing list lengths

Identifying most relevant biasing information for recognition

Reducing computational cost while maintaining recognition accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Purified semantic correlation joint modeling

Grouped-and-competitive purification mechanism

Multi-granularity biasing integration

🔎 Similar Papers

No similar papers found.