Auto-Regressively Generating Multi-View Consistent Images

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenges of view consistency and conditional controllability in multi-view image generation. To this end, we propose MV-AR, a multi-view autoregressive generative framework. Methodologically, MV-AR introduces a unified X-to-multi-view modeling paradigm and incorporates a conditional injection module to fuse heterogeneous modalities—including text, pose, reference images, and shape priors. A progressive training strategy is adopted to enhance geometric and textural consistency across views, while a novel “Shuffle View” data augmentation technique mitigates the scarcity of high-quality multi-view training data. Experiments demonstrate that MV-AR achieves strong cross-view geometric and appearance consistency, supports diverse conditional inputs, and attains generation quality competitive with state-of-the-art diffusion-based methods. Moreover, it exhibits superior generalization, adaptability, and training stability on multi-view synthesis tasks.

Technology Category

Application Category

📝 Abstract

Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the "Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. Code and models will be released at https://github.com/MILab-PKU/MVAR.

Problem

Research questions and friction points this paper is trying to address.

Ensuring multi-view consistency in image generation

Synthesizing shapes and textures under diverse conditions

Overcoming limited high-quality data for training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-regressive model for multi-view consistency

Unified model with condition injection modules

Shuffle View data augmentation technique

🔎 Similar Papers

Single Image, Any Face: Generalisable 3D Face Generation

2024-09-25arXiv.orgCitations: 0

Bosch Group

Renningen, BW, DE

PhD - Effiziente Neuronale Repräsentation von Datensätzen

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)