OpenViGA: Video Generation for Automotive Driving Scenes by Streamlining and Fine-Tuning Open Source Models with Public Data

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current driving video generation methods face three key bottlenecks: reliance on computationally intensive large models, poor architectural interpretability, and absence of open-source implementations. This paper introduces the first fully open-source video generation system tailored for autonomous driving, built upon the BDD100K dataset. It modularly integrates and fine-tunes publicly available pre-trained components—namely, an image tokenizer, a world model, and a video decoder. By decoupling and evaluating these three core modules, the work delivers reproducible design insights. The end-to-end pipeline exclusively employs open models and data, enabling efficient training and inference on academic-grade GPUs. At 256×256 resolution and 4 fps, the system achieves high-fidelity, single-frame-latency video generation. This advances efficiency, transparency, and reproducibility, establishing a new benchmark for autonomous driving simulation and world modeling research.

Technology Category

Application Category

📝 Abstract
Recent successful video generation systems that predict and create realistic automotive driving scenes from short video inputs assign tokenization, future state prediction (world model), and video decoding to dedicated models. These approaches often utilize large models that require significant training resources, offer limited insight into design choices, and lack publicly available code and datasets. In this work, we address these deficiencies and present OpenViGA, an open video generation system for automotive driving scenes. Our contributions are: Unlike several earlier works for video generation, such as GAIA-1, we provide a deep analysis of the three components of our system by separate quantitative and qualitative evaluation: Image tokenizer, world model, video decoder. Second, we purely build upon powerful pre-trained open source models from various domains, which we fine-tune by publicly available automotive data (BDD100K) on GPU hardware at academic scale. Third, we build a coherent video generation system by streamlining interfaces of our components. Fourth, due to public availability of the underlying models and data, we allow full reproducibility. Finally, we also publish our code and models on Github. For an image size of 256x256 at 4 fps we are able to predict realistic driving scene videos frame-by-frame with only one frame of algorithmic latency.
Problem

Research questions and friction points this paper is trying to address.

Streamlining open source models for video generation
Fine-tuning with public data for automotive scenes
Reducing resource requirements and ensuring reproducibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning pre-trained open-source models
Streamlining component interfaces for coherence
Using publicly available data for reproducibility
Björn Möller
Björn Möller
Pitch
HLA Simulation Interoperability
Zhengyang Li
Zhengyang Li
DigiPen Institute of Technology
AI
M
Malte Stelzer
Technische Universität Braunschweig, Institute for Communications Technology
T
Thomas Graave
Technische Universität Braunschweig, Institute for Communications Technology
F
Fabian Bettels
Technische Universität Braunschweig, Institute for Communications Technology
M
Muaaz Ataya
Technische Universität Braunschweig, Institute for Communications Technology
Tim Fingscheidt
Tim Fingscheidt
Professor, IEEE Fellow, ITG Fellow, Technische Universität Braunschweig, Germany
Speech EnhancementAcoustic Signal ProcessingSpeech ProcessingEnvironment PerceptionNLP