TextLDM: Language Modeling with Continuous Latent Diffusion

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work presents the first successful application of a nearly unmodified Vision Diffusion Transformer (DiT) architecture to text generation. To address the insufficient quality of language representations in continuous latent spaces, the authors propose a Representation Alignment (REPA) mechanism that leverages a frozen pretrained language model to enhance the quality of continuous latent variables learned by a Transformer-based variational autoencoder (VAE). Generation is then performed efficiently in this improved latent space using flow matching. Trained from scratch on OpenWebText2, the resulting TextLDM significantly outperforms existing diffusion-based language models and matches the performance of GPT-2 under comparable settings, thereby advancing the development of unified multimodal diffusion architectures.

📝 Abstract

Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.

Problem

Research questions and friction points this paper is trying to address.

language modeling

continuous latent diffusion

text representation

diffusion models

multimodal generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Diffusion

Diffusion Transformer

Representation Alignment