3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing image generation methods suffer from feature entanglement when jointly conditioned on subject, style, and structure, resulting in poor cross-task transferability and limited fine-grained controllability. To address this, we propose the first unified triple-conditioned (subject/style/structure) generation framework. Our method introduces an Adaptive Task-specific Memory (ATM) module that dynamically disentangles and retrieves identity-, texture-, and layout-related priors; establishes 3SGen-Bench—a standardized benchmark for triple-conditioned generation evaluation; and integrates multimodal large language model (MLLM)-driven semantic understanding, learnable queries, VAE-based latent modeling, and a lightweight gated ATM mechanism. Extensive experiments on 3SGen-Bench and multiple public benchmarks demonstrate significant improvements in cross-task fidelity and fine-grained controllability. The framework enables robust composition of complex conditional specifications without task interference, establishing a new paradigm for multi-condition collaborative image generation.

Technology Category

Application Category

📝 Abstract

Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Unifies subject, style, and structure conditioning in one model

Mitigates feature entanglement and limited task transferability

Enables compositional inputs and reduces inter-task interference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for subject, style, and structure conditioning

Adaptive Task-specific Memory module disentangles and retrieves condition-specific priors

MLLM with semantic queries and VAE branch for text-image alignment and detail preservation

🔎 Similar Papers

EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance