LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited scalability of existing controllable video generation methods, which rely on external control signals during inference to maintain temporal consistency of dynamic objects. To overcome this constraint, the authors propose the Localized Semantic Alignment (LSA) framework, which enhances temporal consistency in pretrained video generation models through a single fine-tuning step—without requiring any control signals at inference time. LSA introduces a localized semantic alignment loss that, combined with the standard diffusion loss, leverages off-the-shelf semantic feature extractors to align the semantics of local regions of dynamic objects between real and generated videos. Experiments on nuScenes and KITTI demonstrate that LSA outperforms baseline approaches, achieving improved temporal consistency as validated by enhanced mAP and mIoU metrics, all while incurring no additional inference overhead.

Technology Category

Application Category

📝 Abstract
Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.
Problem

Research questions and friction points this paper is trying to address.

temporal consistency
video generation
controllable generation
autonomous driving
dynamic objects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Localized Semantic Alignment
Temporal Consistency
Video Generation
Diffusion Models
Semantic Feature Alignment
🔎 Similar Papers
No similar papers found.
M
Mirlan Karimov
Mercedes-Benz AG, Germany; ETH Zurich, Switzerland
T
Teodora Spasojevic
Mercedes-Benz AG, Germany; Friedrich-Alexander University Erlangen-Nuremberg, Germany
Markus Braun
Markus Braun
Mercedes-Benz AG
Machine LearningComputer VisionAutonomous DrivingPedestrian Detection
J
Julian Wiederer
Mercedes-Benz AG, Germany; Friedrich-Alexander University Erlangen-Nuremberg, Germany
Vasileios Belagiannis
Vasileios Belagiannis
Professor, Friedrich-Alexander-Universität Erlangen-Nürnberg
Machine LearningComputer VisionRobotics
Marc Pollefeys
Marc Pollefeys
Professor of Computer Science, ETH Zurich, and Director Spatial AI Lab, Microsoft
Computer VisionComputer GraphicsRoboticsMachine LearningAugmented Reality