ReNeg: Learning Negative Embedding with Reward Guidance

📅 2024-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the suboptimal and poor generalization of handcrafted negative prompt embeddings in text-to-image generation, this paper proposes an end-to-end learnable reward-guided negative embedding framework. Methodologically, it is the first to extend classifier-free guidance (CFG) from inference to training, jointly optimizing both global and sample-specific negative embeddings. The framework integrates CLIP text encoder fine-tuning, diffusion model adaptation, and reward modeling to align generated outputs with human preferences. Key contributions include: (1) introducing the first reward-driven paradigm for learning negative embeddings; (2) enabling the first systematic integration of CFG into the training phase; and (3) achieving seamless cross-model transfer (e.g., from SD1.5 to ControlNet, ZeroScope, and VideoCrafter2) and cross-task generalization (text-to-image and text-to-video), significantly improving generation quality and preference alignment.

Technology Category

Application Category

📝 Abstract
In text-to-image (T2I) generation applications, negative embeddings have proven to be a simple yet effective approach for enhancing generation quality. Typically, these negative embeddings are derived from user-defined negative prompts, which, while being functional, are not necessarily optimal. In this paper, we introduce ReNeg, an end-to-end method designed to learn improved Negative embeddings guided by a Reward model. We employ a reward feedback learning framework and integrate classifier-free guidance (CFG) into the training process, which was previously utilized only during inference, thus enabling the effective learning of negative embeddings. We also propose two strategies for learning both global and per-sample negative embeddings. Extensive experiments show that the learned negative embedding significantly outperforms null-text and handcrafted counterparts, achieving substantial improvements in human preference alignment. Additionally, the negative embedding learned within the same text embedding space exhibits strong generalization capabilities. For example, using the same CLIP text encoder, the negative embedding learned on SD1.5 can be seamlessly transferred to text-to-image or even text-to-video models such as ControlNet, ZeroScope, and VideoCrafter2, resulting in consistent performance improvements across the board.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Image Generation
Negative Information Optimization
User Preference Adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReNeg
Negative Information Representation
Image and Video Generation
🔎 Similar Papers
No similar papers found.
X
Xiaomin Li
Advanced Micro Devices Inc., Dalian University of Technology
Yixuan Liu
Yixuan Liu
AMD, Tsinghua University
Generative AI
Takashi Isobe
Takashi Isobe
tsinghua university
Computer VisionMachine Learning
Xu Jia
Xu Jia
Associate Professor at Dalian University of Technology
Computer VisionMachine LearningBio-Inspired Vision
Q
Qinpeng Cui
Advanced Micro Devices Inc., Tsinghua University
D
Dong Zhou
Advanced Micro Devices Inc.
D
Dong Li
Advanced Micro Devices Inc.
Y
You He
Tsinghua University
H
Huchuan Lu
Dalian University of Technology
Zhongdao Wang
Zhongdao Wang
Noah's Ark Lab, Huawei
computer visionautonomous driving
E
E. Barsoum
Advanced Micro Devices Inc.