🤖 AI Summary
Existing feedforward image generators lack task adaptability and generalize poorly to novel control tasks. This paper proposes a two-stage differentiable distillation framework guided by visual-language model (VLM) inference feedback, enabling the generator to learn fine-grained control—such as color, line structure, horizon placement, and depth relationships—in real time via VLM-derived scores and gradient signals from multimodal inputs (text + image). Key contributions include: (i) the first VLM-driven task-aware distillation mechanism, circumventing conventional fine-tuning paradigms; and (ii) integration of contrastive learning with cross-modal alignment for efficient multimodal semantic grounding. The method enables deployment on new tasks within minutes and achieves state-of-the-art performance across multiple visual attribute editing benchmarks, significantly improving both controllability and generalization capability.
📝 Abstract
Prior methods for controlling image generation are limited in their ability to be taught new tasks. In contrast, vision-language models, or VLMs, can learn tasks in-context and produce the correct outputs for a given input. We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. Our scheme uses a VLM to rate the generated images and backpropagates this gradient to update the weights of the image generator. Our general framework enables a wide variety of new control tasks through the same text-and-image based interface. We showcase a handful of applications of this technique for different types of control signals, such as commonsense inferences and visual prompts. With our method, users can implement multimodal controls for properties such as color palette, line weight, horizon position, and relative depth within a matter of minutes. Project page: https://dual-process.github.io.