Two-Stage Acoustic Adaptation with Gated Cross-Attention Adapters for LLM-Based Multi-Talker Speech Recognition

📅 2026-03-28

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the significant performance degradation of large language models (LLMs) in complex speech recognition scenarios involving three or more overlapping speakers, primarily caused by insufficient acoustic integration when relying solely on prefix-based injection, which leads to information loss and misalignment. To overcome this limitation, the authors propose a two-stage acoustic adaptation framework: first, a gated residual cross-attention adapter explicitly injects speaker-aware acoustic embeddings into the decoder as external memory; second, low-rank adaptation (LoRA) fine-tuning is combined with the LLM’s self-attention mechanism to enhance robustness under data-scarce conditions. This approach enables parameter-efficient, fine-grained acoustic fusion and achieves substantial improvements in word error rate on both Libri2Mix and Libri3Mix datasets under clean and noisy conditions, with particularly notable gains in three-speaker overlap scenarios.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are strong decoders for Serialized Output Training (SOT) in two-talker Automatic Speech Recognition (ASR), yet their performance degrades substantially in challenging conditions such as three-talker mixtures. A key limitation is that current systems inject acoustic evidence only through a projected prefix, which can be lossy and imperfectly aligned with the LLM input space, providing insufficient fine-grained grounding during decoding. Addressing this limitation is crucial for robust multi-talker ASR, especially in three-talker mixtures. This paper improves LLM-based multi-talker ASR by explicitly injecting talker-aware acoustic evidence into the decoder. We first revisit Connectionist Temporal Classification (CTC)-derived prefix prompting and compare three variants with increasing acoustic content. The CTC information is obtained using the serialized CTC proposed in our previous works. While acoustic-enriched prompts outperform the SOT-only baseline, prefix-only conditioning remains inadequate for three-talker mixtures. We therefore propose a lightweight gated residual cross-attention adapter and design a two-stage acoustic adaptation framework based on low-rank updates (LoRA). In Stage 1, we insert gated cross-attention adapters after the self-attention sub-layer to stably inject acoustic embeddings as external memory. In Stage 2, we refine both the cross-attention adapters and the pretrained LLM's self-attention projections using parameter-efficient LoRA, improving robustness for large backbones under limited data; the learned updates are merged into the base weights for inference. Experiments on Libri2Mix/Libri3Mix under clean and noisy conditions show consistent gains, with particularly large improvements in three-talker settings.

Problem

Research questions and friction points this paper is trying to address.

multi-talker speech recognition

acoustic adaptation

large language models

three-talker mixtures

acoustic grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

gated cross-attention adapter

two-stage acoustic adaptation

LoRA