🤖 AI Summary
Existing image generation methods suffer from task-specific design limitations or bottlenecks in instruction generalization, task-distribution modeling, and architectural unification within universal models. To address these challenges, we propose the first multi-task-unified, vision-driven image generation framework: (1) it eliminates reliance on language instructions and introduces a novel visual context learning paradigm for task identification; (2) it constructs Graph200K—a graph-structured dataset—to enhance task density and cross-task knowledge transfer; and (3) it theoretically establishes the consistency between image completion and general-purpose generation, enabling effective reuse of pre-trained completion priors. The framework supports in-domain execution, zero-shot transfer, joint multi-task inference, and inverse generation. Evaluated across 12 diverse tasks—including editing, inpainting, and synthesis—it achieves state-of-the-art generalization performance, improves zero-shot accuracy by 27.3%, and attains an average 3.8× speedup in inference over task-specific models.
📝 Abstract
Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.