Temporal Triplane Transformers as Occupancy World Models

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing world models struggle to capture fine-grained spatiotemporal coupling between agent motion and environmental dynamics, while suffering from insufficient real-time inference capability. This paper introduces T³Former, a 4D occupancy world model tailored for autonomous driving. It is the first to integrate tri-plane implicit representation with a temporal Transformer to construct a compact and efficient 4D environmental representation. We propose a history–change decoupled autoregressive prediction paradigm, jointly decoding occupancy forecasts and ego-vehicle trajectories via multi-scale temporal feature extraction and differentiable iterative tri-plane updates. The model is end-to-end optimized. On nuScenes, it achieves 26 FPS (1.44× speedup), an average IoU of 36.09, and reduces planning MAE to 1.0 m—demonstrating significant improvements in modeling accuracy, dynamic granularity, and inference efficiency.

Technology Category

Application Category

📝 Abstract

Recent years have seen significant advances in world models, which primarily focus on learning fine-grained correlations between an agent's motion trajectory and the resulting changes in its surrounding environment. However, existing methods often struggle to capture such fine-grained correlations and achieve real-time predictions. To address this, we propose a new 4D occupancy world model for autonomous driving, termed T$^3$Former. T$^3$Former begins by pre-training a compact triplane representation that efficiently compresses the 3D semantically occupied environment. Next, T$^3$Former extracts multi-scale temporal motion features from the historical triplane and employs an autoregressive approach to iteratively predict the next triplane changes. Finally, T$^3$Former combines the triplane changes with the previous ones to decode them into future occupancy results and ego-motion trajectories. Experimental results demonstrate the superiority of T$^3$Former, achieving 1.44$ imes$ faster inference speed (26 FPS), while improving the mean IoU to 36.09 and reducing the mean absolute planning error to 1.0 meters.

Problem

Research questions and friction points this paper is trying to address.

Captures fine-grained correlations in agent-environment interactions

Achieves real-time predictions for autonomous driving scenarios

Improves inference speed and accuracy in occupancy world models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact triplane representation for 3D environment compression

Multi-scale temporal motion feature extraction

Autoregressive prediction of future occupancy and motion

🔎 Similar Papers

A Spatiotemporal Approach to Tri-Perspective Representation for 3D Semantic Occupancy Prediction