Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-tuning video generation models on structured driving datasets often improves visual fidelity at the expense of spatial accuracy of dynamic elements—revealing a decoupling between visual representation learning and physical motion modeling. Method: We propose a continual learning framework based on cross-domain replay, which periodically re-introduces multi-domain driving scene data during fine-tuning to mitigate dynamic information forgetting. The approach requires no architectural modifications and introduces only a lightweight replay mechanism. Contribution/Results: Our method preserves high visual quality while significantly improving modeling accuracy of dynamic spatial attributes—including object trajectories and relative poses. Experiments across multiple driving video generation benchmarks demonstrate that it effectively balances visual realism and physical consistency. This validates simple, architecture-agnostic replay as a practical and effective strategy for addressing representational decoupling in spatiotemporal video generation.

Technology Category

Application Category

📝 Abstract
Recent advancements in video generation have substantially improved visual quality and temporal coherence, making these models increasingly appealing for applications such as autonomous driving, particularly in the context of driving simulation and so-called "world models". In this work, we investigate the effects of existing fine-tuning video generation approaches on structured driving datasets and uncover a potential trade-off: although visual fidelity improves, spatial accuracy in modeling dynamic elements may degrade. We attribute this degradation to a shift in the alignment between visual quality and dynamic understanding objectives. In datasets with diverse scene structures within temporal space, where objects or perspective shift in varied ways, these objectives tend to highly correlated. However, the very regular and repetitive nature of driving scenes allows visual quality to improve by modeling dominant scene motion patterns, without necessarily preserving fine-grained dynamic behavior. As a result, fine-tuning encourages the model to prioritize surface-level realism over dynamic accuracy. To further examine this phenomenon, we show that simple continual learning strategies, such as replay from diverse domains, can offer a balanced alternative by preserving spatial accuracy while maintaining strong visual quality.
Problem

Research questions and friction points this paper is trying to address.

Investigating fine-tuning effects on video generation spatial accuracy
Addressing trade-off between visual fidelity and dynamic modeling degradation
Exploring continual learning for balanced visual and spatial performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual learning strategies balance trade-offs
Replay from diverse domains preserves accuracy
Maintains visual quality and spatial precision