🤖 AI Summary
This work addresses the challenge that speech large language models (Speech LLMs) struggle to accurately localize hotwords and long-tail named entities under weak supervision due to strong language model priors. To this end, we propose CLAR, a dual-encoder speech-text retrieval framework that, for the first time, leverages the Continuous Integrate-and-Fire (CIF) mechanism to achieve timestamp-free monotonic alignment at the token level in an unsupervised manner. CLAR further incorporates length-aware local matching to enhance acoustic cues for short entities. Through multi-granularity contrastive learning and CIF-based quantity constraints, our approach effectively mitigates representation dilution and attention drift. Experimental results demonstrate that CLAR significantly improves hotword retrieval accuracy and substantially reduces both character error rate (CER) and named entity word error rate (B-WER) over strong baselines.
📝 Abstract
Speech LLM-based ASR often struggles with named entities and long-tail words due to strong internal language-model priors. Retrieval-augmented biasing can help, but its effectiveness depends on accurate hotword localization in full-utterance speech under weak supervision. We propose CLAR, a dual-encoder speech-text retriever that uses Continuous Integrate-and-Fire (CIF) to learn monotonic token-level alignments without timestamps. With length-aware localized matching, CLAR anchors short-entity acoustic cues and reduces representation dilution and attention drift. The retriever is trained with a multi-granularity objective combining global and local segment-level contrastive losses and a CIF quantity constraint. At inference, top-ranked hotwords are injected as contextual prompts for the Speech LLM, improving recognition without shallow fusion. Experiments show that CLAR significantly improves hotword retrieval and reduces both CER and B-WER against strong contextual ASR baselines.