🤖 AI Summary
To address the suboptimal and poor generalization of handcrafted negative prompt embeddings in text-to-image generation, this paper proposes an end-to-end learnable reward-guided negative embedding framework. Methodologically, it is the first to extend classifier-free guidance (CFG) from inference to training, jointly optimizing both global and sample-specific negative embeddings. The framework integrates CLIP text encoder fine-tuning, diffusion model adaptation, and reward modeling to align generated outputs with human preferences. Key contributions include: (1) introducing the first reward-driven paradigm for learning negative embeddings; (2) enabling the first systematic integration of CFG into the training phase; and (3) achieving seamless cross-model transfer (e.g., from SD1.5 to ControlNet, ZeroScope, and VideoCrafter2) and cross-task generalization (text-to-image and text-to-video), significantly improving generation quality and preference alignment.
📝 Abstract
In text-to-image (T2I) generation applications, negative embeddings have proven to be a simple yet effective approach for enhancing generation quality. Typically, these negative embeddings are derived from user-defined negative prompts, which, while being functional, are not necessarily optimal. In this paper, we introduce ReNeg, an end-to-end method designed to learn improved Negative embeddings guided by a Reward model. We employ a reward feedback learning framework and integrate classifier-free guidance (CFG) into the training process, which was previously utilized only during inference, thus enabling the effective learning of negative embeddings. We also propose two strategies for learning both global and per-sample negative embeddings. Extensive experiments show that the learned negative embedding significantly outperforms null-text and handcrafted counterparts, achieving substantial improvements in human preference alignment. Additionally, the negative embedding learned within the same text embedding space exhibits strong generalization capabilities. For example, using the same CLIP text encoder, the negative embedding learned on SD1.5 can be seamlessly transferred to text-to-image or even text-to-video models such as ControlNet, ZeroScope, and VideoCrafter2, resulting in consistent performance improvements across the board.