Temporal Triplane Transformers as Occupancy World Models

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing world models struggle to capture fine-grained spatiotemporal coupling between agent motion and environmental dynamics, while suffering from insufficient real-time inference capability. This paper introduces T³Former, a 4D occupancy world model tailored for autonomous driving. It is the first to integrate tri-plane implicit representation with a temporal Transformer to construct a compact and efficient 4D environmental representation. We propose a history–change decoupled autoregressive prediction paradigm, jointly decoding occupancy forecasts and ego-vehicle trajectories via multi-scale temporal feature extraction and differentiable iterative tri-plane updates. The model is end-to-end optimized. On nuScenes, it achieves 26 FPS (1.44× speedup), an average IoU of 36.09, and reduces planning MAE to 1.0 m—demonstrating significant improvements in modeling accuracy, dynamic granularity, and inference efficiency.

Technology Category

Application Category

📝 Abstract
Recent years have seen significant advances in world models, which primarily focus on learning fine-grained correlations between an agent's motion trajectory and the resulting changes in its surrounding environment. However, existing methods often struggle to capture such fine-grained correlations and achieve real-time predictions. To address this, we propose a new 4D occupancy world model for autonomous driving, termed T$^3$Former. T$^3$Former begins by pre-training a compact triplane representation that efficiently compresses the 3D semantically occupied environment. Next, T$^3$Former extracts multi-scale temporal motion features from the historical triplane and employs an autoregressive approach to iteratively predict the next triplane changes. Finally, T$^3$Former combines the triplane changes with the previous ones to decode them into future occupancy results and ego-motion trajectories. Experimental results demonstrate the superiority of T$^3$Former, achieving 1.44$ imes$ faster inference speed (26 FPS), while improving the mean IoU to 36.09 and reducing the mean absolute planning error to 1.0 meters.
Problem

Research questions and friction points this paper is trying to address.

Captures fine-grained correlations in agent-environment interactions
Achieves real-time predictions for autonomous driving scenarios
Improves inference speed and accuracy in occupancy world models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact triplane representation for 3D environment compression
Multi-scale temporal motion feature extraction
Autoregressive prediction of future occupancy and motion
🔎 Similar Papers
No similar papers found.
H
Haoran Xu
School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China; Peng Cheng Laboratory, Shenzhen 518108, China
P
Peixi Peng
Peng Cheng Laboratory, Shenzhen 518108, China; School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, 518066, China
Guang Tan
Guang Tan
School of Intelligent Systems Engineering, Sun Yat-sen Unversity
Machine LearningMobile ComputingNetworking
Y
Yiqian Chang
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, 518055, China; Peng Cheng Laboratory, Shenzhen 518108, China
Y
Yisen Zhao
School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, 518066, China
Y
Yonghong Tian
Peng Cheng Laboratory, Shenzhen 518108, China; School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, 518066, China; School of Computer Science, Peking University, Beijing 100871, China