🤖 AI Summary
To address inaccurate alignment between input conditions (e.g., edges, depth) and generated 3D geometry in controllable 3D generation, this paper proposes a second-order cycle-consistency regularization framework. Our method introduces a dual-constraint mechanism: (i) view consistency ensures coherent 3D structural reconstruction across multiple viewpoints, and (ii) condition consistency enforces precise recovery of fine-grained geometric details from input conditions. The approach supports end-to-end, joint text-and-condition-image-driven generation via a feed-forward 3D backbone network, integrated with multi-view rendering, signal re-extraction, semantic similarity metrics, and PSNR-optimized losses. Extensive experiments on mainstream benchmarks demonstrate significant improvements in controllability: +14.17% PSNR under edge guidance and +6.26% under sketch guidance. Moreover, our method achieves superior fine-grained structural fidelity compared to state-of-the-art approaches.
📝 Abstract
Despite the remarkable progress of 3D generation, achieving controllability, i.e., ensuring consistency between generated 3D content and input conditions like edge and depth, remains a significant challenge. Existing methods often struggle to maintain accurate alignment, leading to noticeable discrepancies. To address this issue, we propose
ame{}, a new framework that enhances controllable 3D generation by explicitly encouraging cyclic consistency between the second-order 3D content, generated based on extracted signals from the first-order generation, and its original input controls. Specifically, we employ an efficient feed-forward backbone that can generate a 3D object from an input condition and a text prompt. Given an initial viewpoint and a control signal, a novel view is rendered from the generated 3D content, from which the extracted condition is used to regenerate the 3D content. This re-generated output is then rendered back to the initial viewpoint, followed by another round of control signal extraction, forming a cyclic process with two consistency constraints. emph{View consistency} ensures coherence between the two generated 3D objects, measured by semantic similarity to accommodate generative diversity. emph{Condition consistency} aligns the final extracted signal with the original input control, preserving structural or geometric details throughout the process. Extensive experiments on popular benchmarks demonstrate that
ame{} significantly improves controllability, especially for fine-grained details, outperforming existing methods across various conditions (e.g., +14.17% PSNR for edge, +6.26% PSNR for sketch).