VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for driving video generation struggle to achieve fine-grained, object-level control while maintaining spatiotemporal consistency in long-duration videos. To address this challenge, this work proposes a closed-loop generation framework that enables precise manipulation of specific entities—such as 3D objects, images, and text—through multi-view visual-language reasoning. The framework incorporates a Multi-View Visual-Language Evaluator (MV-VLM) and an object-level optimization module, facilitating an iterative generate–evaluate–regenerate refinement process. This approach substantially enhances controllability and spatiotemporal coherence in long driving videos across diverse scenarios, particularly those involving long-tail objects, significantly outperforming current state-of-the-art methods.
📝 Abstract
Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.
Problem

Research questions and friction points this paper is trying to address.

driving video generation
fine-grained control
spatiotemporal consistency
object-level controllability
long video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multiview visual-language reasoning
fine-grained control
spatiotemporal consistency
closed-loop generation
driving video generation
🔎 Similar Papers
No similar papers found.