Occupancy World Model for Robots

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of 3D occupancy evolution prediction for indoor dynamic scenes. We introduce the first embodied-robot-oriented indoor 3D occupancy world model. Methodologically, we propose pose-conditioned causal state attention (CCSA) and hybrid spatiotemporal aggregation (HSTA), establishing the first autoregressive 3D occupancy forecasting framework guided by camera poses. We further release OccWorld-ScanNet, a benchmark tailored for indoor evaluation. Compared to state-of-the-art methods focused on outdoor road scenes, our approach achieves significant improvements in fine-grained, multi-scale, and temporally consistent occupancy evolution modeling. It delivers high-accuracy, strongly generalizable representations of environmental dynamics for indoor robots.

Technology Category

Application Category

📝 Abstract
Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine-grained overall scene dynamics. However, existing methods cluster on the outdoor structured road scenes, while ignoring the exploration of forecasting 3D occupancy scene evolutions for robots in indoor scenes. In this work, we explore a new framework for learning the scene evolutions of observed fine-grained occupancy and propose an occupancy world model based on the combined spatio-temporal receptive field and guided autoregressive transformer to forecast the scene evolutions, called RoboOccWorld. We propose the Conditional Causal State Attention (CCSA), which utilizes camera poses of next state as conditions to guide the autoregressive transformer to adapt and understand the indoor robotics scenarios. In order to effectively exploit the spatio-temporal cues from historical observations, Hybrid Spatio-Temporal Aggregation (HSTA) is proposed to obtain the combined spatio-temporal receptive field based on multi-scale spatio-temporal windows. In addition, we restructure the OccWorld-ScanNet benchmark based on local annotations to facilitate the evaluation of the indoor 3D occupancy scene evolution prediction task. Experimental results demonstrate that our RoboOccWorld outperforms state-of-the-art methods in indoor 3D occupancy scene evolution prediction task. The code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Forecasting 3D occupancy scene evolutions for indoor robots
Learning fine-grained scene dynamics using occupancy world models
Improving indoor robotics decision-making via spatio-temporal scene prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Causal State Attention for indoor scenarios
Hybrid Spatio-Temporal Aggregation for multi-scale cues
RoboOccWorld framework for 3D occupancy forecasting
🔎 Similar Papers
No similar papers found.
Z
Zhang Zhang
Beijing Innovation Center of Humanoid Robotics, Beijing Institute of Technology
Q
Qiang Zhang
Beijing Innovation Center of Humanoid Robotics, Hong Kong University of Science and Technology (Guangzhou)
W
Wei Cui
Beijing Innovation Center of Humanoid Robotics
S
Shuai Shi
Beijing Innovation Center of Humanoid Robotics
Y
Yijie Guo
Beijing Innovation Center of Humanoid Robotics
Gang Han
Gang Han
Professor of Biostatistics, Texas A&M University
StatisticsBiostatisticsMedical researchComputer experiments
Wen Zhao
Wen Zhao
JSPS International Fellow, UT-Austin Postdoc, KAUST
MEMSSensorNonlinear Dynamics
J
Jingkai Sun
Beijing Innovation Center of Humanoid Robotics, Hong Kong University of Science and Technology (Guangzhou)
Jiahang Cao
Jiahang Cao
The University of Hong Kong
Robot LearningGenerative ModelsCognitive-inspired Models
J
Jiaxu Wang
Beijing Innovation Center of Humanoid Robotics, Hong Kong University of Science and Technology (Guangzhou)
H
Hao Cheng
Beijing Innovation Center of Humanoid Robotics, Hong Kong University of Science and Technology (Guangzhou)
X
Xiaozhu Ju
Beijing Innovation Center of Humanoid Robotics
Zhengping Che
Zhengping Che
X-Humanoid
Embodied AIDeep Learning
Renjing Xu
Renjing Xu
HKUST(GZ)
Brain-inspired ComputingHumanoid Computing
J
Jian Tang
Beijing Innovation Center of Humanoid Robotics