Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation

πŸ“… 2025-08-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing image generation methods employ separate control branches for distinct conditioning modalities (e.g., edges, depth, text), leading to model redundancy, computational inefficiency, and difficulty in joint multimodal modeling. To address these limitations, we propose UniGen, a unified generative framework. Its key contributions are: (1) a Conditional Modulation Mixture-of-Experts (CoMoE) module that aggregates features based on semantic similarity and dynamically routes them to specialized experts, mitigating cross-condition feature entanglement; (2) WeaveNetβ€”a dynamic serpentine connection mechanism enabling efficient fine-grained spatial control and global text guidance fusion; and (3) a multimodal feature alignment and conditional disentanglement strategy. Evaluated on Subjects-200K and MultiGen-20M, UniGen achieves state-of-the-art performance across multiple tasks, significantly improving both generation quality and conditioning fidelity. The code is publicly available.

Technology Category

Application Category

πŸ“ Abstract
The image-to-image generation task aims to produce controllable images by leveraging conditional inputs and prompt instructions. However, existing methods often train separate control branches for each type of condition, leading to redundant model structures and inefficient use of computational resources. To address this, we propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs while enhancing generation efficiency and expressiveness. Specifically, to tackle the widely existing parameter redundancy and computational inefficiency in controllable conditional generation architectures, we propose the Condition Modulated Expert (CoMoE) module. This module aggregates semantically similar patch features and assigns them to dedicated expert modules for visual representation and conditional modeling. By enabling independent modeling of foreground features under different conditions, CoMoE effectively mitigates feature entanglement and redundant computation in multi-condition scenarios. Furthermore, to bridge the information gap between the backbone and control branches, we propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches. Extensive experiments on the Subjects-200K and MultiGen-20M datasets across various conditional image generation tasks demonstrate that our method consistently achieves state-of-the-art performance, validating its advantages in both versatility and effectiveness. The code has been uploaded to https://github.com/gavin-gqzhang/UniGen.
Problem

Research questions and friction points this paper is trying to address.

Redundant model structures from separate condition branches
Inefficient computational resource usage in multi-condition generation
Feature entanglement between global and conditional controls
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for diverse conditional image generation
Condition Modulated Expert module reduces feature redundancy
WeaveNet enables dynamic interaction between control branches
πŸ”Ž Similar Papers
No similar papers found.
G
Guoqing Zhang
Beijing Jiaotong University
Xingtong Ge
Xingtong Ge
Hong Kong University of Science and Technology, SenseTime, Beijing Institute of Technology
Diffusion modelsImage/Video CompressionGaussian Splatting
Lu Shi
Lu Shi
Postdoc, Tsinghua University
RoboticsControlData-DrivenKoopman Operator
X
Xin Zhang
SenseTime Research
M
Muqing Xue
Beijing Jiaotong University
W
Wanru Xu
Beijing Jiaotong University
Y
Yigang Cen
Beijing Jiaotong University