Occupancy World Model for Robots

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the absence of 3D occupancy evolution prediction for indoor dynamic scenes. We introduce the first embodied-robot-oriented indoor 3D occupancy world model. Methodologically, we propose pose-conditioned causal state attention (CCSA) and hybrid spatiotemporal aggregation (HSTA), establishing the first autoregressive 3D occupancy forecasting framework guided by camera poses. We further release OccWorld-ScanNet, a benchmark tailored for indoor evaluation. Compared to state-of-the-art methods focused on outdoor road scenes, our approach achieves significant improvements in fine-grained, multi-scale, and temporally consistent occupancy evolution modeling. It delivers high-accuracy, strongly generalizable representations of environmental dynamics for indoor robots.

Technology Category

Application Category

📝 Abstract

Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine-grained overall scene dynamics. However, existing methods cluster on the outdoor structured road scenes, while ignoring the exploration of forecasting 3D occupancy scene evolutions for robots in indoor scenes. In this work, we explore a new framework for learning the scene evolutions of observed fine-grained occupancy and propose an occupancy world model based on the combined spatio-temporal receptive field and guided autoregressive transformer to forecast the scene evolutions, called RoboOccWorld. We propose the Conditional Causal State Attention (CCSA), which utilizes camera poses of next state as conditions to guide the autoregressive transformer to adapt and understand the indoor robotics scenarios. In order to effectively exploit the spatio-temporal cues from historical observations, Hybrid Spatio-Temporal Aggregation (HSTA) is proposed to obtain the combined spatio-temporal receptive field based on multi-scale spatio-temporal windows. In addition, we restructure the OccWorld-ScanNet benchmark based on local annotations to facilitate the evaluation of the indoor 3D occupancy scene evolution prediction task. Experimental results demonstrate that our RoboOccWorld outperforms state-of-the-art methods in indoor 3D occupancy scene evolution prediction task. The code will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Forecasting 3D occupancy scene evolutions for indoor robots

Learning fine-grained scene dynamics using occupancy world models

Improving indoor robotics decision-making via spatio-temporal scene prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Causal State Attention for indoor scenarios

Hybrid Spatio-Temporal Aggregation for multi-scale cues

RoboOccWorld framework for 3D occupancy forecasting

🔎 Similar Papers

Autonomous Exploration and Semantic Updating of Large-Scale Indoor Environments with Mobile Robots