VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing online 3D semantic occupancy prediction methods struggle to preserve structural boundary accuracy and rely on predefined scene-scale priors, limiting their practical deployment. This work proposes a voxel-centric recursive perception and fusion framework that abandons Gaussian-centered modeling and enables open-ended map expansion without requiring initial scale estimation. By integrating cross-temporal logical aggregation (TLA), reliability-aware confidence modulation (RCM), and confidence-driven state updating (CSU), the method achieves efficient multi-frame online fusion and zero-shot generalization. It establishes new state-of-the-art results on both Occ-ScanNet and EmbodiedOcc-ScanNet for local and embodied scene understanding, while demonstrating strong generalization capabilities in real-world, previously unseen environments.
📝 Abstract
Crucial for autonomous exploration, online 3D occupancy prediction and mapping incrementally constructs dense spatial representations on the fly. However, recent Gaussian-centric methods struggle with structural boundary fidelity and rely heavily on predefined scene-size priors, fundamentally limiting their operational efficiency. In this work, we present VEOcc, a voxel-centric framework formulated as a recursive perception-and-assimilation paradigm. By eliminating the need for initial scale estimation, VEOcc enables highly streamlined, open-ended map expansion. Furthermore, to robustly aggregate noisy temporal observations within the discrete voxel space, we propose a Spatio-Temporal-Aware Online Update Strategy. It integrates Cross-Temporal Logit Aggregation (TLA) for temporal consistency, Reliability-Aware Confidence Modulation (RCM) for spatial uncertainty calibration, and Confidence-Driven Incremental State Update (CSU) for robust global state assimilation. % Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings, providing an accurate and efficient solution for real-world exploration. Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings. Notably, zero-shot evaluations on self-collected video sequences further confirm its robust out-of-distribution generalization capability in completely unseen real-world environments. Ultimately, our framework provides an accurate and highly efficient solution for autonomous exploration. Code and supplementary visualizations are available on our project page: https://wryzju.github.io/VEOcc/.
Problem

Research questions and friction points this paper is trying to address.

online 3D occupancy prediction
structural boundary fidelity
scene-size priors
autonomous exploration
semantic occupancy mapping
Innovation

Methods, ideas, or system contributions that make the work stand out.

Voxel-Centric
Online Semantic Occupancy
Spatio-Temporal-Aware Update
Recursive Perception-and-Assimilation
Zero-Shot Generalization