Unifying Autoregressive and Diffusion-Based Sequence Generation

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the inherent trade-offs among generation quality, diversity, and inference efficiency between autoregressive (AR) and diffusion-based sequence generation paradigms. To unify these frameworks, we propose position-specific noise hyperschedules that parameterize both AR and diffusion processes within a single formulation; design a hybrid token-level noising mechanism that dynamically balances absorbing-noise and uniform-noising strategies to enable error correction; and introduce KV-cache-adapted attention masking to accelerate parallel decoding. Experiments on standard language modeling benchmarks demonstrate state-of-the-art perplexity, along with significant improvements in generated sequence diversity, fidelity, and robustness—while simultaneously reducing inference latency.

Technology Category

Application Category

📝 Abstract

We present significant extensions to diffusion-based sequence generation models, blurring the line with autoregressive language models. We introduce hyperschedules, which assign distinct noise schedules to individual token positions, generalizing both autoregressive models (e.g., GPT) and conventional diffusion models (e.g., SEDD, MDLM) as special cases. Second, we propose two hybrid token-wise noising processes that interpolate between absorbing and uniform processes, enabling the model to fix past mistakes, and we introduce a novel inference algorithm that leverages this new feature in a simplified context inspired from MDLM. To support efficient training and inference, we design attention masks compatible with KV-caching. Our methods achieve state-of-the-art perplexity and generate diverse, high-quality sequences across standard benchmarks, suggesting a promising path for autoregressive diffusion-based sequence generation.

Problem

Research questions and friction points this paper is trying to address.

Unify autoregressive and diffusion models for sequence generation

Introduce hyperschedules for distinct token noise schedules

Propose hybrid noising processes to fix past mistakes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hyperschedules for token-specific noise schedules

Hybrid token-wise noising processes

Attention masks compatible with KV-caching

🔎 Similar Papers

No similar papers found.