Esoteric Language Models

๐Ÿ“… 2025-06-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
While mask diffusion language models (MDMs) enable parallel and controllable generation, they suffer from higher perplexity than autoregressive (AR) models and lack critical inference optimizations such as key-value (KV) caching. Method: We propose Eso-LMsโ€”a novel architecture that seamlessly interpolates AR and MDM paradigmsโ€”by introducing the first KV caching mechanism tailored for mask diffusion, jointly training with AR priors and diffusion objectives, and optimizing sampling scheduling. Contribution/Results: Eso-LMs unifies perplexity-controllable, parallel, controllable, and efficient inference within a single framework. Experiments demonstrate new state-of-the-art perplexity on standard language modeling benchmarks. Inference speed improves by 65ร— over typical MDMs and 4ร— over prior semi-autoregressive methods, while preserving strong controllability and parallelism.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the **first to introduce KV caching for MDMs** while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to **65x** faster inference than standard MDMs and **4x** faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: [http://s-sahoo.github.io/Eso-LMs](http://s-sahoo.github.io/Eso-LMs)
Problem

Research questions and friction points this paper is trying to address.

Improving perplexity and efficiency of diffusion-based language models
Enabling KV caching for Masked Diffusion Models without losing parallelism
Achieving faster inference than standard MDMs and semi-autoregressive models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses AR and MDM paradigms for better performance
Introduces KV caching for MDMs, enhancing efficiency
Optimizes sampling schedule for faster inference speed
๐Ÿ”Ž Similar Papers