OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end speech generation methods suffer from two key limitations: (1) autoregressive generation of text and speech tokens lacks synchronous modeling, and (2) cross-modal alignment is coarse-grained and weakly semantic-aware. To address these, we propose OmniDRCA—the first foundational model enabling parallel speech-text joint autoregressive generation. Its core innovations are: (1) a dual-resolution speech representation—integrating frame-level precision with semantic-level abstraction—and (2) a contrastive learning–driven fine-grained cross-modal alignment mechanism that explicitly enforces semantic consistency between speech and text. Evaluated on Spoken Question Answering, OmniDRCA establishes new state-of-the-art performance for parallel joint generation, significantly outperforming prior approaches. Moreover, it demonstrates strong scalability toward full-duplex conversational systems, supporting real-time, bidirectional interaction with minimal latency.

Technology Category

Application Category

📝 Abstract
Recent studies on end-to-end speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents OmniDRCA, a parallel speech-text foundation model based on joint autoregressive modeling, featuring dual-resolution speech representations and contrastive cross-modal alignment. Our approach processes speech and text representations in parallel while enhancing audio comprehension through contrastive alignment. Experimental results on Spoken Question Answering benchmarks demonstrate that OmniDRCA establishes new state-of-the-art (SOTA) performance among parallel joint speech-text modeling based foundation models, and achieves competitive performance compared to interleaved models. Additionally, we explore the potential of extending the framework to full-duplex conversational scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enables joint autoregressive modeling of speech and text
Improves audio comprehension via contrastive cross-modal alignment
Extends framework to full-duplex conversational scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint autoregressive modeling for speech-text
Dual-resolution speech representations
Contrastive cross-modal alignment
🔎 Similar Papers
No similar papers found.
Chao-Hong Tan
Chao-Hong Tan
University of Science and Technology of China
NLGNLU
Q
Qian Chen
Tongyi Lab, Alibaba Group
W
Wen Wang
Tongyi Lab, Alibaba Group
Chong Deng
Chong Deng
alibaba group
machine learningnatural language processing
Q
Qinglin Zhang
Tongyi Lab, Alibaba Group
L
Luyao Cheng
Tongyi Lab, Alibaba Group
Hai Yu
Hai Yu
Nankai University
RoboticsNonlinear Control
X
Xin Zhang
School of Computer Science, Fudan University
X
Xiang Lv
Tongyi Lab, Alibaba Group
T
Tianyu Zhao
Tongyi Lab, Alibaba Group
C
Chong Zhang
Tongyi Lab, Alibaba Group
Yukun Ma
Yukun Ma
Alibaba Group
ASRSLU
Yafeng Chen
Yafeng Chen
University of Science and Technology of China
Large Audio Language ModelSpeech Signal ProcessingDeep Learning
H
Hui Wang
Tongyi Lab, Alibaba Group
Jiaqing Liu
Jiaqing Liu
Renmin University of China
Natural Language ProcessingDeep LearningMachine LearningFinance
J
Jieping Ye
Tongyi Lab, Alibaba Group