ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address cross-view geometric inconsistency in single-image-and-text-driven multi-view image generation, this paper proposes the first end-to-end discrete diffusion framework. It formulates multi-view synthesis as a joint text-conditioned visual token sequence prediction task, leveraging MAGVIT-v2 for efficient visual tokenization and integrating multimodal masked modeling with autoregressive Transformer decoding to enforce geometric consistency. Crucially, the method requires no 3D priors or explicit geometric constraints—structural coherence across views emerges solely from stochastic masking and self-attention mechanisms. The architecture is lightweight and fully differentiable. Extensive experiments on GSO and 3D-FUTURE demonstrate state-of-the-art performance: our approach achieves superior PSNR, SSIM, and LPIPS scores compared to existing continuous diffusion and 3D-aware methods, ranking first on multiple metrics.

Technology Category

Application Category

📝 Abstract

Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.

Problem

Research questions and friction points this paper is trying to address.

Generates multi-view images from a single image and text description

Ensures geometric consistency across different viewpoints without complex 3D constraints

Uses discrete diffusion models to simplify multi-view synthesis as token prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete diffusion models for multi-view image generation

Visual tokenization via MAGVIT-v2 for viewpoint representation

Random masking with self-attention ensures cross-view consistency

🔎 Similar Papers

ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis