Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech large language models (SLMs) suffer from low recall and insufficient accuracy in domain-specific term and neologism recognition/translation. To address this, we propose a lightweight, efficient terminology intervention method: (1) modeling speech-term alignment via cross-modal cross-attention, and—novelty—explicitly converting attention weights into term existence probabilities; (2) incorporating curriculum learning to optimize term retrieval; and (3) releasing the first bilingual speech dataset with fine-grained term annotations. Our approach requires no SLM fine-tuning and operates solely via post-hoc intervention. Experiments demonstrate 92.57% and 86.83% term recall on Chinese and English tasks, respectively, with only 8.71 ms latency per query and a 6–17% improvement in term accuracy. This work significantly advances controllable term generation in speech-language modeling.

Technology Category

Application Category

📝 Abstract
Recent advances in speech large language models (SLMs) have improved speech recognition and translation in general domains, but accurately generating domain-specific terms or neologisms remains challenging. To address this, we propose Attention2Probability: attention-driven terminology probability estimation for robust speech-to-text system, which is lightweight, flexible, and accurate. Attention2Probability converts cross-attention weights between speech and terminology into presence probabilities, and it further employs curriculum learning to enhance retrieval accuracy. Furthermore, to tackle the lack of data for speech-to-text tasks with terminology intervention, we create and release a new speech dataset with terminology to support future research in this area. Experimental results show that Attention2Probability significantly outperforms the VectorDB method on our test set. Specifically, its maximum recall rates reach 92.57% for Chinese and 86.83% for English. This high recall is achieved with a latency of only 8.71ms per query. Intervening in SLMs' recognition and translation tasks using Attention2Probability-retrieved terms improves terminology accuracy by 6-17%, while revealing that the current utilization of terminology by SLMs has limitations.
Problem

Research questions and friction points this paper is trying to address.

Estimating terminology presence probabilities using attention weights
Improving domain-specific term recognition in speech systems
Addressing data scarcity for speech-to-text with terminology
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-driven terminology probability estimation
Curriculum learning enhances retrieval accuracy
Converts cross-attention weights into presence probabilities
🔎 Similar Papers
No similar papers found.
Y
Yanfan Du
School of Computer Science and Engineering, Northeastern University, Shenyang, China
J
Jun Zhang
ByteDance
B
Bin Wang
ByteDance
J
Jin Qiu
ByteDance
Lu Huang
Lu Huang
ByteDance Inc
Speech RecognitionAcoustic ModelingDeep Learning
Yuan Ge
Yuan Ge
Northeastern University, China
ReasoningMultimodality LLMs
X
Xiaoqian Liu
School of Computer Science and Engineering, Northeastern University, Shenyang, China
T
Tong Xiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China; NiuTrans Research, Shenyang, China
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing