Energy-Based Diffusion Language Models for Text Generation

📅 2024-10-28

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Discrete diffusion models underperform autoregressive models in text generation, especially when sampling steps are reduced. To address this, we propose the Energy Diffusion Language Model (EDLM), the first discrete diffusion framework incorporating a residual-form energy function to directly model the energy distribution over full sequences. Methodologically, EDLM integrates knowledge distillation from pretrained autoregressive models, bidirectional Transformer–driven noise contrastive estimation (NCE) for efficient energy learning, and a parallel importance sampling algorithm for fast inference. On standard language modeling benchmarks, EDLM substantially outperforms existing discrete diffusion models, achieving perplexity close to autoregressive baselines while maintaining zero performance degradation at 1.3× sampling speed—effectively alleviating the quality–efficiency trade-off inherent in iterative discrete generation.

Technology Category

Application Category

📝 Abstract

Despite remarkable progress in autoregressive language models, alternative generative paradigms beyond left-to-right generation are still being actively explored. Discrete diffusion models, with the capacity for parallel generation, have recently emerged as a promising alternative. Unfortunately, these models still underperform the autoregressive counterparts, with the performance gap increasing when reducing the number of sampling steps. Our analysis reveals that this degradation is a consequence of an imperfect approximation used by diffusion models. In this work, we propose Energy-based Diffusion Language Model (EDLM), an energy-based model operating at the full sequence level for each diffusion step, introduced to improve the underlying approximation used by diffusion models. More specifically, we introduce an EBM in a residual form, and show that its parameters can be obtained by leveraging a pretrained autoregressive model or by finetuning a bidirectional transformer via noise contrastive estimation. We also propose an efficient generation algorithm via parallel important sampling. Comprehensive experiments on language modeling benchmarks show that our model can consistently outperform state-of-the-art diffusion models by a significant margin, and approaches autoregressive models' perplexity. We further show that, without any generation performance drop, our framework offers a 1.3$ imes$ sampling speedup over existing diffusion models.

Problem

Research questions and friction points this paper is trying to address.

Improve diffusion models' text generation

Enhance performance with fewer steps

Achieve faster sampling speed efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy-based Diffusion Language Model

Parallel Important Sampling

Pretrained Autoregressive Model

🔎 Similar Papers

TEncDM: Understanding the Properties of the Diffusion Model in the Space of Language Model Encodings