Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limitations of existing control generation methods for discrete diffusion language models, which typically employ uniform intervention strategies that degrade output quality under multi-attribute control. The study reveals, for the first time, that different semantic attributes exhibit distinct temporal commitment profiles during the denoising process. Building on this mechanistic insight, the authors propose an adaptive intervention scheduler that applies control exclusively during each attribute’s active formation phase. By analyzing internal representations of four models (ranging from 124M to 8B parameters) using sparse autoencoders, they quantify attribute formation timelines and formulate a closed-form optimization framework for intervention scheduling. Experiments across four models and seven tasks demonstrate substantial improvements over baselines: under triple-attribute control, the method achieves a steering strength of 93%, a gain of up to 15 percentage points, while effectively preserving generation quality.

📝 Abstract

Discrete diffusion language models (DLMs) generate text by iteratively denoising all positions in parallel, offering an alternative to autoregressive models. Controlled generation methods for DLMs, imported from autoregressive models, apply uniform intervention at every denoising steps. We show this uniform schedule degrades quality, and the damage compounds when multiple attributes are steered jointly. To diagnose the failure, we train sparse autoencoders on four DLMs (124M-8B parameters) and find that different attributes commit on distinct schedules, varying in timing, sharpness, and magnitude. For instance, topic commits within the first 2\% of denoising, whereas sentiment emerges gradually over 20\% of the process. Consequently, uniform intervention wastes steering capacity on steps where the target attribute has already solidified or has yet to emerge. We propose a novel adaptive scheduler that concentrates interventions on the steps where an attribute is actively forming and leaves the rest of generation untouched. The cost-control trade-off admits a closed-form characterization: the advantage of adaptive over uniform scheduling is governed by a single dispersion statistic of the commitment distribution. Across four DLMs and seven steering tasks, our method achieves precise control without the degradation typical of uniform interventions. Especially on challenging simultaneous three-attribute control, it reaches up to 93\% steering strength, beating the strongest baseline by up to 15\% points while preserving generation quality.

Problem

Research questions and friction points this paper is trying to address.

discrete diffusion language models

controlled generation

uniform intervention

attribute steering

generation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete diffusion language models

adaptive intervention scheduling

attribute commitment dynamics