USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

๐Ÿ“… 2025-08-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing methods treat style-driven and subject-driven image generation as disjoint tasks, failing to jointly optimize style fidelity and subject consistency. To address this, we propose USOโ€”the first unified framework that synergistically models content and style via disentangled learning. Our approach constructs a large-scale triplet dataset and introduces a three-stage training paradigm: style alignment, content-style disentanglement, and style reward learning (SRL), all built upon diffusion-based optimization. Furthermore, we release USO-Benchโ€”the first comprehensive benchmark for joint evaluation of style fidelity and subject consistency. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models on both metrics, significantly outperforming prior methods. To foster reproducibility and community advancement, we fully open-source the code, pre-trained models, and dataset.

Technology Category

Application Category

๐Ÿ“ Abstract
Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO
Problem

Research questions and friction points this paper is trying to address.

Unifying style-driven and subject-driven image generation tasks
Disentangling content and style features for better composition
Enhancing both style similarity and subject consistency simultaneously
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Style-Subject Optimized customization model
Disentangled learning scheme with style-alignment and content-style objectives
Style reward-learning paradigm to enhance performance
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Shaojin Wu
UXO Team, Intelligent Creation Lab, ByteDance
Mengqi Huang
Mengqi Huang
University of Science and Technology of China
Image GenerationVideo GenerationUnified Multimodal GenerationGenerative AI
Y
Yufeng Cheng
UXO Team, Intelligent Creation Lab, ByteDance
W
Wenxu Wu
UXO Team, Intelligent Creation Lab, ByteDance
J
Jiahe Tian
UXO Team, Intelligent Creation Lab, ByteDance
Yiming Luo
Yiming Luo
PhD student, The University of Hong Kong
Robotics
Fei Ding
Fei Ding
Unknown affiliation
Qian He
Qian He
ByteDance