Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the coarse temporal modeling and insufficient cue exploitation in vision-based 3D semantic occupancy prediction (VisionOcc). We propose a unified temporal fusion paradigm grounded in gradient-descent-based feature updating. For the first time, we systematically identify and model three critical temporal cues: scene consistency, motion calibration, and geometric complementarity. Furthermore, we reinterpret recurrent neural network (RNN) architectures as iterative gradient descent steps in feature space, enabling end-to-end differentiable fusion of heterogeneous temporal representations. Evaluated on the nuScenes Occ3D benchmark, our method achieves consistent improvements—increasing mIoU by 1.4–4.8 percentage points—while reducing memory consumption by 27–72% over state-of-the-art approaches. The framework demonstrates superior efficiency and accuracy, establishing a new paradigm for temporal fusion in VisionOcc.

Technology Category

Application Category

📝 Abstract
We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, focusing on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level consistency, motion calibration, and geometric complementation. These cues capture diverse facets of temporal evolution and make distinct contributions across various modules in the VisionOcc framework. To effectively fuse temporal signals across heterogeneous representations, we propose a novel fusion strategy by reinterpreting the formulation of vanilla RNNs. This reinterpretation leverages gradient descent on features to unify the integration of diverse temporal information, seamlessly embedding the proposed temporal cues into the network. Extensive experiments on nuScenes demonstrate that GDFusion significantly outperforms established baselines. Notably, on Occ3D benchmark, it achieves 1.4%-4.8% mIoU improvements and reduces memory consumption by 27%-72%.
Problem

Research questions and friction points this paper is trying to address.

Explores temporal fusion for 3D semantic occupancy prediction
Identifies key temporal cues in vision-based occupancy frameworks
Proposes gradient-based fusion to unify diverse temporal signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified gradient descent for temporal fusion
Leverages scene-level, motion, geometric cues
Reduces memory use by 27%-72%
🔎 Similar Papers
No similar papers found.
Dubing Chen
Dubing Chen
University of Macau
Computer VisionMachine Learning
H
Huan Zheng
SKL-IOTSC, CIS, University of Macau
J
Jin Fang
SKL-IOTSC, CIS, University of Macau
X
Xingping Dong
Wuhan University
X
Xianfei Li
COWAROBOT Co. Ltd.
Wenlong Liao
Wenlong Liao
COWAROBOT
RoboticsAI
T
Tao He
COWAROBOT Co. Ltd.
P
Pai Peng
COWAROBOT Co. Ltd.
Jianbing Shen
Jianbing Shen
Professor, University of Macau
Computer VisionMedical Image AnalysisVision and LanguageSelf-Driving CarsAI in Healthcare