QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-cost 3D annotation for large-scale scenes and limitations of existing self-supervised methods—such as reliance on implicit geometric modeling or discrete voxel representations—hinder scalable, high-fidelity 3D semantic occupancy learning from monocular images. To address this, we propose a self-supervised framework based on continuous four-dimensional (4D) spatiotemporal queries. Our approach directly learns high-precision 3D semantic occupancy from single-view RGB inputs. Key contributions include: (1) a compressible scene encoder that preserves fine-grained near-field geometry while compactly representing distant regions, enabling long-range reasoning with constant memory overhead; (2) cross-frame 4D query supervision, compatible with either pseudo-point clouds generated by vision foundation models or raw LiDAR data; and (3) an end-to-end continuous occupancy modeling paradigm. On the Occ3D-nuScenes benchmark, our method achieves a 26% improvement in semantic RayIoU and runs at 11.6 FPS—substantially outperforming state-of-the-art purely vision-based approaches.

Technology Category

Application Category

📝 Abstract
Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/
Problem

Research questions and friction points this paper is trying to address.

Learning 3D semantic occupancy from images without expensive manual annotations
Overcoming limitations of 2D rendering consistency and discretized voxel grids
Enabling long-range 3D reasoning with constant memory constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-based self-supervised 4D spatio-temporal occupancy learning
Contractive scene representation for memory-efficient long-range reasoning
Supervision from vision foundation models or raw LiDAR data
🔎 Similar Papers
No similar papers found.