🤖 AI Summary
This work addresses the challenge of generating geometrically consistent, identity-preserving 360° view synthesis videos and high-quality 3D human reconstructions from a single input portrait image. To this end, we propose the first end-to-end framework based on video diffusion models that explicitly enforces inter-view geometric consistency and identity fidelity by generating temporally coherent multi-view image sequences, and directly outputs textured 3D meshes. Our method outperforms existing state-of-the-art approaches in terms of multi-view photorealism, 3D model completeness, and fine-grained detail preservation. Notably, this is the first successful application of video diffusion models to the task of 360° human view synthesis, demonstrating their effectiveness in modeling complex spatial-temporal dependencies for coherent 3D-aware generation.
📝 Abstract
We present a method for generating a full 360{\deg} orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated their ability in generating photorealistic results that align well with the given prompts. Inspired by these results, we propose HumanOrbit, a video diffusion model for multi-view human image generation. Our approach enables the model to synthesize continuous camera rotations around the subject, producing geometrically consistent novel views while preserving the appearance and identity of the person. Using the generated multi-view frames, we further propose a reconstruction pipeline that recovers a textured mesh of the subject. Experimental results validate the effectiveness of HumanOrbit for multi-view image generation and that the reconstructed 3D models exhibit superior completeness and fidelity compared to those from state-of-the-art baselines.