π€ AI Summary
This work addresses the throughput bottleneck inherent in autoregressive language models due to sequential generation, as well as the performance degradation, high training cost, and lack of convergence guarantees associated with diffusion-based parallel generation. To reconcile these limitations, the authors propose a dual-view architecture that integrates a frozen large autoregressive model with a lightweight trainable diffusion module, both sharing a high-fidelity KV cache. The autoregressive component pre-fills the context, while the diffusion module enables parallel token generation, linked through an exact consistency mechanism that ensures lossless inference. This approach uniquely unifies the strengths of both generation paradigms within a single framework, achieving up to 7.8Γ speedup with only O(1) additional cache overhead and minimal parameter increase, while preserving the original modelβs output quality and providing theoretical convergence guarantees.
π Abstract
We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.