End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering

๐Ÿ“… 2025-11-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the low accuracy and poor robustness of speech retrieval in long-audio spoken question answering, this paper proposes CLSR, an end-to-end contrastive languageโ€“speech retrieval model. CLSR introduces a learnable, text-like intermediate representation to explicitly align acoustic features with linguistic semantics, thereby circumventing error propagation inherent in cascaded approaches and mitigating the excessive modality gap in conventional cross-modal contrastive learning. By jointly optimizing acoustic and textual encoders within a unified contrastive learning framework, CLSR enables fine-grained semantic matching between speech segments and textual questions. Evaluated on four cross-modal retrieval benchmarks, CLSR significantly outperforms existing speech retrievers and cascaded baselines, achieving substantial improvements in both accuracy and generalization capability for long-audio spoken question answering.

Technology Category

Application Category

๐Ÿ“ Abstract
Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches combining speech recognition with text retrieval, providing a robust foundation for advancing practical long-form SQA applications.
Problem

Research questions and friction points this paper is trying to address.

Extracting question-relevant segments from long audio recordings for spoken question answering
Improving speech retrieval performance for processing long-form spoken content
Bridging modality gaps between speech and text in cross-modal retrieval systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end contrastive retriever for long audio
Converts acoustic features to text-like representations
Bridges modality gap for spoken question answering
J
Jiliang Hu
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Zuchao Li
Zuchao Li
Wuhan University
Natural Language ProcessingMachine Learning
B
Baoyuan Qi
Xiaomi, Beijing, China
G
Guoming Liu
Xiaomi, Beijing, China
P
Ping Wang
School of Information Management, Wuhan University, Wuhan, China