Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the challenge of disentangling and jointly controlling four key factors—viewpoint, pose, clothing, and identity—in human image synthesis. We propose a stage-wise generation framework: first synthesizing a clothed A-pose base body, then generating the back-view image, and finally jointly modulating pose and viewpoint. This design overcomes the disentanglement failure of end-to-end models under cross-domain data (MVHumanNet for multi-view synthesis and VTON for virtual try-on), introducing stage-wise conditional modeling and cross-domain feature alignment. Experiments demonstrate that our method significantly improves disentanglement and visual fidelity in in-the-wild scenarios, outperforming end-to-end baselines across quantitative metrics. It enables high-precision, fine-grained editing and exhibits strong generalization to unseen poses, viewpoints, and clothing configurations.

Technology Category

Application Category

📝 Abstract

Achieving fine-grained controllability in human image synthesis is a long-standing challenge in computer vision. Existing methods primarily focus on either facial synthesis or near-frontal body generation, with limited ability to simultaneously control key factors such as viewpoint, pose, clothing, and identity in a disentangled manner. In this paper, we introduce a new disentangled and controllable human synthesis task, which explicitly separates and manipulates these four factors within a unified framework. We first develop an end-to-end generative model trained on MVHumanNet for factor disentanglement. However, the domain gap between MVHumanNet and in-the-wild data produce unsatisfacotry results, motivating the exploration of virtual try-on (VTON) dataset as a potential solution. Through experiments, we observe that simply incorporating the VTON dataset as additional data to train the end-to-end model degrades performance, primarily due to the inconsistency in data forms between the two datasets, which disrupts the disentanglement process. To better leverage both datasets, we propose a stage-by-stage framework that decomposes human image generation into three sequential steps: clothed A-pose generation, back-view synthesis, and pose and view control. This structured pipeline enables better dataset utilization at different stages, significantly improving controllability and generalization, especially for in-the-wild scenarios. Extensive experiments demonstrate that our stage-by-stage approach outperforms end-to-end models in both visual fidelity and disentanglement quality, offering a scalable solution for real-world tasks. Additional demos are available on the project page: https://taited.github.io/discohuman-project/.

Problem

Research questions and friction points this paper is trying to address.

Achieving fine-grained controllability in human image synthesis

Disentangling and controlling viewpoint, pose, clothing, and identity

Improving generalization for in-the-wild human image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end generative model for factor disentanglement

Stage-by-stage framework for sequential generation

Utilizes MVHumanNet and VTON datasets effectively

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)