TABES: Trajectory-Aware Backward-on-Entropy Steering for Masked Diffusion Models

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the trajectory locking and global inconsistency issues in non-autoregressive generation with masked diffusion models, which stem from local decisions lacking long-term planning. To this end, the authors propose the Backward-on-Entropy (BoE) Steering framework, which approximates infinite-horizon foresight via a single backward pass by leveraging the gradient of future entropy with respect to input embeddings as a steering signal to optimize the generation trajectory. The key innovations include the first derivation of a Token Influence Score from a first-order expansion of a trajectory cost functional and the design of an ActiveQueryAttention sparse adjoint operator that enables mathematically rigorous uncertainty minimization while maintaining computational efficiency. Experiments demonstrate that the proposed method outperforms existing demasking approaches in both inference efficiency and generation quality, achieving Pareto-optimal performance.

Technology Category

Application Category

📝 Abstract

Masked Diffusion Models (MDMs) have emerged as a promising non-autoregressive paradigm for generative tasks, offering parallel decoding and bidirectional context utilization. However, current sampling methods rely on simple confidence-based heuristics that ignore the long-term impact of local decisions, leading to trajectory lock-in where early hallucinations cascade into global incoherence. While search-based methods mitigate this, they incur prohibitive computational costs ($O(K)$ forward passes per step). In this work, we propose Backward-on-Entropy (BoE) Steering, a gradient-guided inference framework that approximates infinite-horizon lookahead via a single backward pass. We formally derive the Token Influence Score (TIS) from a first-order expansion of the trajectory cost functional, proving that the gradient of future entropy with respect to input embeddings serves as an optimal control signal for minimizing uncertainty. To ensure scalability, we introduce \texttt{ActiveQueryAttention}, a sparse adjoint primitive that exploits the structure of the masking objective to reduce backward pass complexity. BoE achieves a superior Pareto frontier for inference-time scaling compared to existing unmasking methods, demonstrating that gradient-guided steering offers a mathematically principled and efficient path to robust non-autoregressive generation. We will release the code.

Problem

Research questions and friction points this paper is trying to address.

Masked Diffusion Models

trajectory lock-in

non-autoregressive generation

sampling heuristics

global incoherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Backward-on-Entropy

Masked Diffusion Models

Token Influence Score