3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory

πŸ“… 2025-12-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing image generation methods suffer from feature entanglement when jointly conditioned on subject, style, and structure, resulting in poor cross-task transferability and limited fine-grained controllability. To address this, we propose the first unified triple-conditioned (subject/style/structure) generation framework. Our method introduces an Adaptive Task-specific Memory (ATM) module that dynamically disentangles and retrieves identity-, texture-, and layout-related priors; establishes 3SGen-Benchβ€”a standardized benchmark for triple-conditioned generation evaluation; and integrates multimodal large language model (MLLM)-driven semantic understanding, learnable queries, VAE-based latent modeling, and a lightweight gated ATM mechanism. Extensive experiments on 3SGen-Bench and multiple public benchmarks demonstrate significant improvements in cross-task fidelity and fine-grained controllability. The framework enables robust composition of complex conditional specifications without task interference, establishing a new paradigm for multi-condition collaborative image generation.

Technology Category

Application Category

πŸ“ Abstract
Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Unifies subject, style, and structure conditioning in one model
Mitigates feature entanglement and limited task transferability
Enables compositional inputs and reduces inter-task interference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for subject, style, and structure conditioning
Adaptive Task-specific Memory module disentangles and retrieves condition-specific priors
MLLM with semantic queries and VAE branch for text-image alignment and detail preservation
X
Xinyang Song
School of Artificial Intelligence, UCAS
L
Libin Wang
AntGroup
W
Weining Wang
CASIA
Z
Zhiwei Li
School of Artificial Intelligence, UCAS
J
Jianxin Sun
AntGroup
D
Dandan Zheng
AntGroup
J
Jingdong Chen
AntGroup
Q
Qi Li
School of Artificial Intelligence, UCAS
Zhenan Sun
Zhenan Sun
Institute of Automation, Chinese Academy of Sciences
BiometricsPattern RecognitionComputer Vision