DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance

📅 2025-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing driving scene reconstruction methods rely on 3D bounding boxes and BEV maps, limiting their capacity to model complex geometry and multimodal semantics—resulting in low video fidelity. This paper proposes a dual-branch conditional diffusion model tailored for autonomous driving. We introduce Occupancy Ray-shape Sampling as a novel structured conditional input, explicitly encoding spatial occupancy geometry along ray trajectories. To enhance fine-grained control and cross-modal alignment, we propose a foreground-aware mask loss and a semantic fusion attention mechanism. Furthermore, we design a reward-guided diffusion framework that explicitly optimizes for multi-view consistency and global coherence. Evaluated on nuScenes, our method achieves a 4.09% reduction in FID, improves BEV vehicle and road segmentation mIoU by 4.50% and 1.70%, respectively, and boosts foreground 3D detection mAP by 1.46%.

Technology Category

Application Category

📝 Abstract
Accurate and high-fidelity driving scene reconstruction demands the effective utilization of comprehensive scene information as conditional inputs. Existing methods predominantly rely on 3D bounding boxes and BEV road maps for foreground and background control, which fail to capture the full complexity of driving scenes and adequately integrate multimodal information. In this work, we present DualDiff, a dual-branch conditional diffusion model designed to enhance driving scene generation across multiple views and video sequences. Specifically, we introduce Occupancy Ray-shape Sampling (ORS) as a conditional input, offering rich foreground and background semantics alongside 3D spatial geometry to precisely control the generation of both elements. To improve the synthesis of fine-grained foreground objects, particularly complex and distant ones, we propose a Foreground-Aware Mask (FGM) denoising loss function. Additionally, we develop the Semantic Fusion Attention (SFA) mechanism to dynamically prioritize relevant information and suppress noise, enabling more effective multimodal fusion. Finally, to ensure high-quality image-to-video generation, we introduce the Reward-Guided Diffusion (RGD) framework, which maintains global consistency and semantic coherence in generated videos. Extensive experiments demonstrate that DualDiff achieves state-of-the-art (SOTA) performance across multiple datasets. On the NuScenes dataset, DualDiff reduces the FID score by 4.09% compared to the best baseline. In downstream tasks, such as BEV segmentation, our method improves vehicle mIoU by 4.50% and road mIoU by 1.70%, while in BEV 3D object detection, the foreground mAP increases by 1.46%. Code will be made available at https://github.com/yangzhaojason/DualDiff.
Problem

Research questions and friction points this paper is trying to address.

Enhances driving scene generation using dual-branch diffusion model.
Improves foreground object synthesis with Foreground-Aware Mask loss.
Ensures video quality with Reward-Guided Diffusion framework.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch diffusion model for scene generation
Occupancy Ray-shape Sampling for precise control
Reward-Guided Diffusion for video consistency
🔎 Similar Papers
No similar papers found.
Z
Zhao Yang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Zezhong Qian
Zezhong Qian
XianJiaotongUniversity
World ModelAutonomous DrivingVideo GenerationRobot Manipulation
Xiaofan Li
Xiaofan Li
East China Normal University
Computer Vision
W
Weixiang Xu
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
G
Gongpeng Zhao
University of Science and Technology of China, Anhui 230052, China
R
Ruohong Yu
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
L
Lingsi Zhu
University of Science and Technology of China, Anhui 230052, China
Longjun Liu
Longjun Liu
Xi'an Jiaotong University
Computer ArchitectureVLSIDeep learningDNN accelerator