Dual-Process Image Generation

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing feedforward image generators lack task adaptability and generalize poorly to novel control tasks. This paper proposes a two-stage differentiable distillation framework guided by visual-language model (VLM) inference feedback, enabling the generator to learn fine-grained control—such as color, line structure, horizon placement, and depth relationships—in real time via VLM-derived scores and gradient signals from multimodal inputs (text + image). Key contributions include: (i) the first VLM-driven task-aware distillation mechanism, circumventing conventional fine-tuning paradigms; and (ii) integration of contrastive learning with cross-modal alignment for efficient multimodal semantic grounding. The method enables deployment on new tasks within minutes and achieves state-of-the-art performance across multiple visual attribute editing benchmarks, significantly improving both controllability and generalization capability.

Technology Category

Application Category

📝 Abstract
Prior methods for controlling image generation are limited in their ability to be taught new tasks. In contrast, vision-language models, or VLMs, can learn tasks in-context and produce the correct outputs for a given input. We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. Our scheme uses a VLM to rate the generated images and backpropagates this gradient to update the weights of the image generator. Our general framework enables a wide variety of new control tasks through the same text-and-image based interface. We showcase a handful of applications of this technique for different types of control signals, such as commonsense inferences and visual prompts. With our method, users can implement multimodal controls for properties such as color palette, line weight, horizon position, and relative depth within a matter of minutes. Project page: https://dual-process.github.io.
Problem

Research questions and friction points this paper is trying to address.

Enabling feed-forward image generators to learn new tasks from VLMs
Providing multimodal controls for diverse image properties
Overcoming limitations of prior methods in task adaptability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-process distillation for image generation
VLM-rated backpropagation updates generator weights
Multimodal controls via text-image interface
🔎 Similar Papers
No similar papers found.