OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive 3D occupancy world models suffer from inefficient inference, temporal degradation in long-horizon prediction, and poor controllability. This paper proposes a controllable and efficient generative 3D occupancy world model. Our approach addresses these limitations through three key innovations: (1) a spatiotemporally decoupled, scale-progressive generation framework that explicitly separates geometric scene modeling from motion dynamics evolution; (2) a lightweight TensFormer architecture incorporating global pose aggregation to explicitly encode spatiotemporal causality; and (3) unified modeling of occupancy voxels and ego-vehicle motion sequences, enabling fine-grained conditional control via driving actions. Evaluated on nuScenes and other benchmarks, our method achieves significant improvements in long-term (≥3 s) occupancy prediction accuracy and temporal stability, while accelerating inference by 2.1×. It further ensures high-fidelity geometric reconstruction and precise motion controllability.

Technology Category

Application Category

📝 Abstract
In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from extbf{inefficiency}, extbf{temporal degradation} in long-term generation and extbf{lack of controllability}. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a extbf{TensFormer}, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.
Problem

Research questions and friction points this paper is trying to address.

Generating high-fidelity long-term 3D occupancy scenes efficiently
Overcoming temporal degradation in autoregressive occupancy prediction models
Enhancing controllability of ego-motion in occupancy sequence generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal next-scale prediction task reformulation
TensFormer architecture for spatiotemporal modeling
Holistic pose aggregation strategy for controllability