🤖 AI Summary
To address low attribute control accuracy, high computational overhead, and poor generalization in controllable text generation, this paper proposes RegDiff—a novel framework that explicitly incorporates attribute regularization into the training phase of diffusion models. Unlike classifier-guided sampling, RegDiff jointly optimizes text reconstruction and attribute supervision in the latent space, eliminating reliance on external classifiers during inference. Built upon a VAE-based encoder-decoder architecture, it ensures semantic fidelity while enabling fine-grained attribute alignment within the latent diffusion process. Extensive experiments across five multi-attribute benchmark datasets demonstrate that RegDiff significantly outperforms strong baselines: it improves attribute control accuracy by 12.3%, enhances language quality (BLEU +4.1, BERTScore +2.8), and reduces sampling latency by 37%. These results validate RegDiff’s effectiveness, efficiency, and generalization capability.
📝 Abstract
Generating stylistic text with specific attributes is a key problem in controllable text generation. Recently, diffusion models have emerged as a powerful paradigm for both visual and textual generation. Existing approaches can be broadly categorized into classifier-free guidance (CFG) and classifier guidance (CG) methods. While CFG effectively preserves semantic content, it often fails to provide effective attribute control. In contrast, CG modifies the denoising trajectory using classifier gradients, enabling better attribute alignment but incurring high computational costs during sampling and suffering from classifier generalization issues. In this work, we propose RegDiff, a regularized diffusion framework that leverages attribute features without requiring a pretrained classifier during sampling, thereby achieving controllable generation with reduced computational costs. Specifically, RegDiff employs a VAE-based encoder--decoder architecture to ensure reconstruction fidelity and a latent diffusion model trained with attribute supervision to enable controllable text generation. Attribute information is injected only during training. Experiments on five datasets spanning multiple stylistic attributes demonstrate that RegDiff outperforms strong baselines in generating stylistic texts. These results validate the effectiveness of RegDiff as an efficient solution for attribute-controllable text diffusion. Our code, datasets, and resources will be released upon publication at https://github.com/xxxx.