🤖 AI Summary
Existing end-to-end speech generation methods suffer from two key limitations: (1) autoregressive generation of text and speech tokens lacks synchronous modeling, and (2) cross-modal alignment is coarse-grained and weakly semantic-aware. To address these, we propose OmniDRCA—the first foundational model enabling parallel speech-text joint autoregressive generation. Its core innovations are: (1) a dual-resolution speech representation—integrating frame-level precision with semantic-level abstraction—and (2) a contrastive learning–driven fine-grained cross-modal alignment mechanism that explicitly enforces semantic consistency between speech and text. Evaluated on Spoken Question Answering, OmniDRCA establishes new state-of-the-art performance for parallel joint generation, significantly outperforming prior approaches. Moreover, it demonstrates strong scalability toward full-duplex conversational systems, supporting real-time, bidirectional interaction with minimal latency.
📝 Abstract
Recent studies on end-to-end speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents OmniDRCA, a parallel speech-text foundation model based on joint autoregressive modeling, featuring dual-resolution speech representations and contrastive cross-modal alignment. Our approach processes speech and text representations in parallel while enhancing audio comprehension through contrastive alignment. Experimental results on Spoken Question Answering benchmarks demonstrate that OmniDRCA establishes new state-of-the-art (SOTA) performance among parallel joint speech-text modeling based foundation models, and achieves competitive performance compared to interleaved models. Additionally, we explore the potential of extending the framework to full-duplex conversational scenarios.