Viewpoint Consistency in 3D Generation via Attention and CLIP Guidance

📅 2024-12-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the viewpoint inconsistency problem—termed the “Janus problem”—in text-to-3D generation, wherein diffusion-based 2D image synthesis exhibits systematic viewpoint bias relative to the viewpoints required for subsequent 3D optimization. To resolve this without fine-tuning the pretrained diffusion model, we propose Adaptive Cross-Attention Guidance (ACG), a plug-and-play mechanism integrating three key components: (1) CLIP-guided cross-attention modulation, leveraging viewpoint-text similarity to steer attention toward geometrically consistent features; (2) staged prompt refinement, progressing from coarse to fine granularity to better align semantic and geometric constraints; and (3) end-to-end consistency alignment jointly optimized with differentiable 3D rendering. To our knowledge, ACG is the first method achieving significant viewpoint consistency improvement without any model fine-tuning. It attains state-of-the-art performance across multiple benchmarks, preserves original inference speed, and requires no architectural or training modifications.

Technology Category

Application Category

📝 Abstract
Despite recent advances in text-to-3D generation techniques, current methods often suffer from geometric inconsistencies, commonly referred to as the Janus Problem. This paper identifies the root cause of the Janus Problem: viewpoint generation bias in diffusion models, which creates a significant gap between the actual generated viewpoint and the expected one required for optimizing the 3D model. To address this issue, we propose a tuning-free approach called the Attention and CLIP Guidance (ACG) mechanism. ACG enhances desired viewpoints by adaptively controlling cross-attention maps, employs CLIP-based view-text similarities to filter out erroneous viewpoints, and uses a coarse-to-fine optimization strategy with staged prompts to progressively refine 3D generation. Extensive experiments demonstrate that our method significantly reduces the Janus Problem without compromising generation speed, establishing ACG as an efficient, plug-and-play component for existing text-to-3D frameworks.
Problem

Research questions and friction points this paper is trying to address.

Addressing geometric inconsistencies in text-to-3D generation
Reducing viewpoint generation bias in diffusion models
Improving 3D model optimization via viewpoint refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

ACG controls cross-attention maps adaptively
CLIP filters viewpoints via view-text similarity
Coarse-to-fine optimization with staged prompts
🔎 Similar Papers
No similar papers found.
Q
Qing Zhang
Australian National University
Zehao Chen
Zehao Chen
PhD, Yale University
Porous MediaFluid DynamicsPolymerHydrogel
Jinguang Tong
Jinguang Tong
Australian National University
computer vision3d reconstruction
J
Jing Zhang
Australian National University
J
Jie Hong
The University of Hong Kong
X
Xuesong Li
Australian National University, CSIRO, Australia