WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models

๐Ÿ“… 2025-02-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing RAG frameworks rely on automatic speech recognition (ASR) to convert speech into text, incurring audio information loss, transcription errors, and high computational overhead. This paper introduces the first end-to-end native audio RAG framework that operates directly on raw waveforms, bypassing ASR entirely. Our approach comprises two core innovations: (1) WavRetrieverโ€”a cross-modal embedding alignment and joint retrieval module supporting heterogeneous knowledge bases comprising both text and audio; and (2) a chain-of-thought prompting mechanism tailored to enhance contextual modeling in spoken dialogues. Experiments demonstrate that our method matches the retrieval performance of ASR-based text baselines while accelerating inference by 10ร—. To our knowledge, this is the first work to achieve unified knowledge representation across modalities and enable end-to-end spoken language understanding and generation grounded in raw audio.

Technology Category

Application Category

๐Ÿ“ Abstract
Retrieval Augmented Generation (RAG) has gained widespread adoption owing to its capacity to empower large language models (LLMs) to integrate external knowledge. However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support. WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw audio for both embedding and retrieval. 2) WavRAG integrates audio and text into a unified knowledge representation. Specifically, we propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base, and further enhance the in-context capabilities of spoken dialogue models through the integration of chain-of-thought reasoning. In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration. Furthermore, WavRAG's unique text-audio hybrid retrieval capability extends the boundaries of RAG to the audio modality.
Problem

Research questions and friction points this paper is trying to address.

Enhance spoken dialogue models with audio support
Integrate audio and text for unified knowledge representation
Improve retrieval performance and computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Native end-to-end audio support
Direct raw audio processing
Text-audio hybrid knowledge retrieval
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yifu Chen
Zhejiang University
S
Shengpeng Ji
Zhejiang University
H
Haoxiao Wang
Zhejiang University
Z
Ziqing Wang
Beijing University of Technology
S
Siyu Chen
Zhejiang University
Jinzheng He
Jinzheng He
Alibaba Qwen Team, Zhejiang University
Omni LLMPost-TrainingRL
J
Jin Xu
Alibaba Group
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing