Compressed and Smooth Latent Space for Text Diffusion Modeling

πŸ“… 2025-06-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Autoregressive language models suffer from sequential decoding bottlenecks and weak global coherence, while text diffusion models have progressed slowly due to the difficulty of modeling high-dimensional discrete token spaces. To address this, we propose Cosmos: the first diffusion-based text generation framework built upon a learned compressed latent space. Cosmos employs a jointly trained autoencoder to achieve 8Γ— sequence compression and introduces pre-trained language model activation alignment and reconstruction constraints to preserve semantic fidelity. Diffusion training is conducted in the frozen encoder’s latent space with perturbation-augmented learning. Evaluated on story generation, question generation, summarization, and detoxification, Cosmos matches or surpasses both autoregressive and existing diffusion baselines in generation quality, while accelerating inference by over 2Γ—. It thus achieves a favorable trade-off among generation quality, efficiency, and controllability.

Technology Category

Application Category

πŸ“ Abstract
Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by $8 imes$ while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2 imes$ faster inference.
Problem

Research questions and friction points this paper is trying to address.

Slow decoding in autoregressive text generation models
High dimensionality of token-level diffusion models
Maintaining global coherence in text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compressed latent space for text diffusion
Autoencoder with token-level reconstruction
Faster inference with parallel generation
πŸ”Ž Similar Papers
No similar papers found.
V
Viacheslav Meshchaninov
HSE University, Constructor University
E
Egor Chimbulatov
HSE University
A
Alexander Shabalin
HSE University, Constructor University
A
Aleksandr Abramov
SaluteDevices
Dmitry Vetrov
Dmitry Vetrov
Professor of Computer Science, Constructor University
Deep learningBayesian inferenceGraphical models