Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving

๐Ÿ“… 2026-02-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing Sim2Real video generation methods for autonomous driving struggle to simultaneously achieve control consistency and visual realism. This work proposes a novel generative framework that, for the first time, incorporates features from the vision foundation model DINOv3 into this task. By synergistically combining Principal Subspace Projection with Random Channel Tail Drop, the method preserves semantic structure while enhancing fine-grained visual fidelity. Furthermore, a learnable spatial alignment module and a causal temporal aggregator are integrated to significantly improve temporal coherence and motion clarity. The resulting approach achieves state-of-the-art performance in autonomous driving Sim2Real translation, markedly advancing structural fidelity, visual realism, and temporal stabilityโ€”all while maintaining strict consistency with input control signals.

Technology Category

Application Category

๐Ÿ“ Abstract
Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by"baking in"synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for"texture baking,"while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3's high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: https://albertchen98.github.io/DwD-project/
Problem

Research questions and friction points this paper is trying to address.

Sim2Real
Consistency-Realism Dilemma
Autonomous Driving
Video Generation
Domain Gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Foundation Model
Sim-to-Real
DINOv3 Features
Temporal Consistency
Diffusion Models
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xuyang Chen
Technical University of Munich
C
Conglang Zhang
Huawei Riemann Lab
C
Chuanheng Fu
Huawei Riemann Lab
Zihao Yang
Zihao Yang
New York University
Natural Language Processing
K
Kaixuan Zhou
Huawei Riemann Lab
Y
Yizhi Zhang
Huawei Riemann Lab
J
Jianan He
Huawei Riemann Lab
Yanfeng Zhang
Yanfeng Zhang
Northeastern University, China
Database SystemsMachine Learning Systems
M
Mingwei Sun
Huawei Riemann Lab
Zhen Dong
Zhen Dong
Wuhan University
3D Computer VisionIntelligent Transportation SystemUrban Sustainable Development
X
Xiaoxiao Long
Nanjing University
Zengmao Wang
Zengmao Wang
Associate Professor, School of Computer Science, Wuhan University
Artificial IntelligenceMachine LearningRemote Sensing
L
Liqiu Meng
Technical University of Munich