🤖 AI Summary
This work addresses the challenge of generating geometrically consistent multi-view scenes from a single freehand sketch, which is inherently sparse in geometric information, prone to spatial distortions, and lacks paired training data. The authors propose a novel approach based on a video Transformer architecture that incorporates a Camera-Aware Attention Adapter (CA3) and a Correspondence-based Sparse Supervision Loss (CSL), enabling direct multi-view synthesis in a single denoising pass—without requiring reference images or multi-stage optimization. Key contributions include the construction of the first large-scale, automatically synthesized and filtered sketch-to-multi-view paired dataset, and a sparse geometric supervision mechanism leveraging Structure-from-Motion. Experiments demonstrate significant improvements over state-of-the-art two-stage methods, with over 60% reduction in FID, a 23% increase in geometric consistency (Corr-Acc), and up to 3.7× faster inference.
📝 Abstract
We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation.
We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions.
Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.