๐ค AI Summary
Existing Sim2Real video generation methods for autonomous driving struggle to simultaneously achieve control consistency and visual realism. This work proposes a novel generative framework that, for the first time, incorporates features from the vision foundation model DINOv3 into this task. By synergistically combining Principal Subspace Projection with Random Channel Tail Drop, the method preserves semantic structure while enhancing fine-grained visual fidelity. Furthermore, a learnable spatial alignment module and a causal temporal aggregator are integrated to significantly improve temporal coherence and motion clarity. The resulting approach achieves state-of-the-art performance in autonomous driving Sim2Real translation, markedly advancing structural fidelity, visual realism, and temporal stabilityโall while maintaining strict consistency with input control signals.
๐ Abstract
Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by"baking in"synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for"texture baking,"while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3's high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: https://albertchen98.github.io/DwD-project/