🤖 AI Summary
This paper addresses the problem of full-body personalized image generation driven by a single in-the-wild human portrait. We propose the first vision foundation model tailored for full-body persona customization. Methodologically, we introduce fine-grained identity modeling and text-semantic alignment for the entire human body—not merely the face—and design an automatic vision-language-driven data curation pipeline, releasing the large-scale paired dataset Visual Persona-500K. Our architecture features a region-aware Transformer encoder-decoder that decomposes the input image into localized body regions and extracts independent dense identity embeddings to condition diffusion-based generation. Experiments demonstrate state-of-the-art performance across diverse scenarios, poses, and clothing styles. Ablation studies validate the efficacy of each component, and downstream applications—including virtual try-on and character editing—further confirm the model’s generalizability and practical utility.
📝 Abstract
We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.