VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

πŸ“… 2025-10-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the dual challenges of high-fidelity geometric reconstruction and cross-modal alignment in text-to-3D generation. Methodologically, it introduces VIST3Aβ€”a novel framework that end-to-end concatenates a pretrained text-to-video generator with a feed-forward multi-view 3D reconstruction network, without fine-tuning either module’s weights. To enforce cross-modal consistency, it proposes Direct Reward Fine-tuning, optimizing jointly for 3D structural coherence and visual quality. Crucially, VIST3A operates without large-scale 3D annotations and supports plug-and-play integration of diverse text-to-video generators and reconstruction backbones. Experiments demonstrate that VIST3A consistently outperforms state-of-the-art Gaussian-splatting-based text-to-3D methods across multiple architecture combinations, producing high-fidelity point clouds. This establishes a new paradigm for efficient, reliable, and text-driven 3D scene generation.

Technology Category

Application Category

πŸ“ Abstract
The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce VIST3A, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit model stitching, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt direct reward finetuning, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.
Problem

Research questions and friction points this paper is trying to address.

Combining text-to-video generation with 3D reconstruction for text-to-3D synthesis
Preserving encoded knowledge when stitching different model components together
Aligning generated video latents with 3D decoder for consistent geometry output
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stitches video generator to 3D reconstruction network
Uses model stitching to preserve pretrained knowledge
Aligns components via direct reward finetuning
πŸ”Ž Similar Papers
No similar papers found.