Pippo: High-Resolution Multi-View Humans from a Single Image

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the task of generating high-resolution, multi-view human videos from a single casually captured portrait image. We propose the first multi-view diffusion Transformer architecture tailored for 360° surround video synthesis. Methodologically: (i) we introduce an attention bias inference mechanism enabling generalization beyond trained view counts (>5×); (ii) we design a two-stage pixel-alignment training strategy leveraging Plücker rays and spatial anchors; and (iii) we employ a shallow MLP to encode camera pose implicitly, eliminating reliance on explicit camera parameters or 3D model priors. Contributions include a novel 3D consistency evaluation metric and pretraining on 3B unlabeled portrait images. Experiments demonstrate that our method significantly outperforms prior art in single-image-to-multi-view human generation, achieving—for the first time—geometrically consistent and photorealistic 1K-resolution turnaround videos.

Technology Category

Application Category

📝 Abstract
We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image.
Problem

Research questions and friction points this paper is trying to address.

Generates high-resolution multi-view human videos
Uses single casual photo as input
Ensures 3D consistency in multi-view generations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative model for multi-view humans
Pre-trained on 3B human images
Attention biasing for enhanced view generation
🔎 Similar Papers
No similar papers found.