🤖 AI Summary
Diffusion models excel at modeling complex hairstyles but struggle to generate consistent, high-fidelity hair across multiple views—a key bottleneck for digital human applications. To address this, we propose the first multi-view diffusion-based framework for high-fidelity hair transfer. Our method introduces polar-coordinate embeddings to explicitly encode geometric relationships among camera viewpoints and designs a temporal attention mechanism to enable pose-controllable, smooth cross-view transitions. We further develop an end-to-end triplet data generation pipeline and integrate a baldness transformer, an enhanced inpainting model, and a face-tuned multi-view diffusion backbone. Experiments demonstrate that our approach significantly outperforms existing methods in multi-view consistency, fine-grained detail fidelity, and visual coherence—establishing a new state-of-the-art benchmark for this task. The code is publicly available.
📝 Abstract
While diffusion-based methods have shown impressive capabilities in capturing diverse and complex hairstyles, their ability to generate consistent and high-quality multi-view outputs -- crucial for real-world applications such as digital humans and virtual avatars -- remains underexplored. In this paper, we propose Stable-Hair v2, a novel diffusion-based multi-view hair transfer framework. To the best of our knowledge, this is the first work to leverage multi-view diffusion models for robust, high-fidelity, and view-consistent hair transfer across multiple perspectives. We introduce a comprehensive multi-view training data generation pipeline comprising a diffusion-based Bald Converter, a data-augment inpainting model, and a face-finetuned multi-view diffusion model to generate high-quality triplet data, including bald images, reference hairstyles, and view-aligned source-bald pairs. Our multi-view hair transfer model integrates polar-azimuth embeddings for pose conditioning and temporal attention layers to ensure smooth transitions between views. To optimize this model, we design a novel multi-stage training strategy consisting of pose-controllable latent IdentityNet training, hair extractor training, and temporal attention training. Extensive experiments demonstrate that our method accurately transfers detailed and realistic hairstyles to source subjects while achieving seamless and consistent results across views, significantly outperforming existing methods and establishing a new benchmark in multi-view hair transfer. Code is publicly available at https://github.com/sunkymepro/StableHairV2.