Multi-View Consistent Human Image Customization via In-Context Learning

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing personalized generation methods struggle to simultaneously ensure identity consistency and multi-view controllability. To address this, we propose PersonalView—a novel approach that achieves identity-consistent multi-view synthesis using only 100 single-view images. Built upon a pre-trained diffusion Transformer, PersonalView introduces a conditional architecture and a semantic correspondence alignment loss to fully harness the model’s in-context learning capability, enabling fine-grained viewpoint control and cross-view identity preservation—without fine-tuning the backbone. Experiments demonstrate that PersonalView consistently outperforms baseline methods relying on large-scale multi-view datasets across multi-view consistency, text-image alignment, identity similarity, and visual quality. Remarkably, it achieves performance on par with or exceeding that of large-scale models, despite requiring minimal training data—highlighting its efficiency, scalability, and practicality for real-world personalized generation tasks.

Technology Category

Application Category

📝 Abstract

Recent advances in personalized generative models demonstrate impressive results in creating identity-consistent images of the same person under diverse settings. Yet, we note that most methods cannot control the viewpoint of the generated image, nor generate consistent multiple views of the person. To address this problem, we propose a lightweight adaptation method, PersonalView, capable of enabling an existing model to acquire multi-view generation capability with as few as 100 training samples. PersonalView consists of two key components: First, we design a conditioning architecture to take advantage of the in-context learning ability of the pre-trained diffusion transformer. Second, we preserve the original generative ability of the pretrained model with a new Semantic Correspondence Alignment Loss. We evaluate the multi-view consistency, text alignment, identity similarity, and visual quality of PersonalView and compare it to recent baselines with potential capability of multi-view customization. PersonalView significantly outperforms baselines trained on a large corpus of multi-view data with only 100 training samples.

Problem

Research questions and friction points this paper is trying to address.

Generating consistent multi-view images from personalized generative models

Controlling viewpoint while maintaining identity consistency in images

Achieving multi-view generation with minimal training data samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight adaptation method for multi-view generation

In-context learning with pre-trained diffusion transformer

Semantic Correspondence Alignment Loss preserves original ability

🔎 Similar Papers

Single Image, Any Face: Generalisable 3D Face Generation