Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Speech large language models (SLMs) suffer from low recall and insufficient accuracy in domain-specific term and neologism recognition/translation. To address this, we propose a lightweight, efficient terminology intervention method: (1) modeling speech-term alignment via cross-modal cross-attention, and—novelty—explicitly converting attention weights into term existence probabilities; (2) incorporating curriculum learning to optimize term retrieval; and (3) releasing the first bilingual speech dataset with fine-grained term annotations. Our approach requires no SLM fine-tuning and operates solely via post-hoc intervention. Experiments demonstrate 92.57% and 86.83% term recall on Chinese and English tasks, respectively, with only 8.71 ms latency per query and a 6–17% improvement in term accuracy. This work significantly advances controllable term generation in speech-language modeling.

Technology Category

Application Category

📝 Abstract

Recent advances in speech large language models (SLMs) have improved speech recognition and translation in general domains, but accurately generating domain-specific terms or neologisms remains challenging. To address this, we propose Attention2Probability: attention-driven terminology probability estimation for robust speech-to-text system, which is lightweight, flexible, and accurate. Attention2Probability converts cross-attention weights between speech and terminology into presence probabilities, and it further employs curriculum learning to enhance retrieval accuracy. Furthermore, to tackle the lack of data for speech-to-text tasks with terminology intervention, we create and release a new speech dataset with terminology to support future research in this area. Experimental results show that Attention2Probability significantly outperforms the VectorDB method on our test set. Specifically, its maximum recall rates reach 92.57% for Chinese and 86.83% for English. This high recall is achieved with a latency of only 8.71ms per query. Intervening in SLMs' recognition and translation tasks using Attention2Probability-retrieved terms improves terminology accuracy by 6-17%, while revealing that the current utilization of terminology by SLMs has limitations.

Problem

Research questions and friction points this paper is trying to address.

Estimating terminology presence probabilities using attention weights

Improving domain-specific term recognition in speech systems

Addressing data scarcity for speech-to-text with terminology

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-driven terminology probability estimation

Curriculum learning enhances retrieval accuracy

Converts cross-attention weights into presence probabilities

🔎 Similar Papers

No similar papers found.