π€ AI Summary
Existing image generation methods suffer from feature entanglement when jointly conditioned on subject, style, and structure, resulting in poor cross-task transferability and limited fine-grained controllability. To address this, we propose the first unified triple-conditioned (subject/style/structure) generation framework. Our method introduces an Adaptive Task-specific Memory (ATM) module that dynamically disentangles and retrieves identity-, texture-, and layout-related priors; establishes 3SGen-Benchβa standardized benchmark for triple-conditioned generation evaluation; and integrates multimodal large language model (MLLM)-driven semantic understanding, learnable queries, VAE-based latent modeling, and a lightweight gated ATM mechanism. Extensive experiments on 3SGen-Bench and multiple public benchmarks demonstrate significant improvements in cross-task fidelity and fine-grained controllability. The framework enables robust composition of complex conditional specifications without task interference, establishing a new paradigm for multi-condition collaborative image generation.
π Abstract
Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.