🤖 AI Summary
Existing world models struggle to capture fine-grained spatiotemporal coupling between agent motion and environmental dynamics, while suffering from insufficient real-time inference capability. This paper introduces T³Former, a 4D occupancy world model tailored for autonomous driving. It is the first to integrate tri-plane implicit representation with a temporal Transformer to construct a compact and efficient 4D environmental representation. We propose a history–change decoupled autoregressive prediction paradigm, jointly decoding occupancy forecasts and ego-vehicle trajectories via multi-scale temporal feature extraction and differentiable iterative tri-plane updates. The model is end-to-end optimized. On nuScenes, it achieves 26 FPS (1.44× speedup), an average IoU of 36.09, and reduces planning MAE to 1.0 m—demonstrating significant improvements in modeling accuracy, dynamic granularity, and inference efficiency.
📝 Abstract
Recent years have seen significant advances in world models, which primarily focus on learning fine-grained correlations between an agent's motion trajectory and the resulting changes in its surrounding environment. However, existing methods often struggle to capture such fine-grained correlations and achieve real-time predictions. To address this, we propose a new 4D occupancy world model for autonomous driving, termed T$^3$Former. T$^3$Former begins by pre-training a compact triplane representation that efficiently compresses the 3D semantically occupied environment. Next, T$^3$Former extracts multi-scale temporal motion features from the historical triplane and employs an autoregressive approach to iteratively predict the next triplane changes. Finally, T$^3$Former combines the triplane changes with the previous ones to decode them into future occupancy results and ego-motion trajectories. Experimental results demonstrate the superiority of T$^3$Former, achieving 1.44$ imes$ faster inference speed (26 FPS), while improving the mean IoU to 36.09 and reducing the mean absolute planning error to 1.0 meters.