Enhancing Indoor Occupancy Prediction via Sparse Query-Based Multi-Level Consistent Knowledge Distillation

📅 2025-11-01
🏛️ IEEE Robotics and Automation Letters
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the trade-off between efficiency and accuracy in indoor scene occupancy prediction, where dense methods suffer from computational redundancy and sparse query-based approaches lack robustness in complex scenes. To overcome these limitations, the authors propose DiScene, a novel framework that leverages multi-level consistent knowledge distillation—spanning encoder features, queries, priors, and anchors—and a teacher-guided parameter initialization strategy. These innovations significantly enhance the accuracy, robustness, and convergence speed of sparse query models. On Occ-ScanNet, DiScene achieves 23.2 FPS and outperforms the OPUS baseline by 36.1%, surpassing even its depth-augmented variant. When combined with depth input, DiScene sets a new state of the art, exceeding EmbodiedOcc by 3.7% in performance while running 1.62× faster at inference. The method also demonstrates strong generalization across Occ3D-nuScenes and in-the-wild scenes.

Technology Category

Application Category

📝 Abstract
Occupancy prediction provides critical geometric and semantic understanding for robotics but faces efficiency-accuracy trade-offs. Current dense methods suffer computational waste on empty voxels, while sparse query-based approaches lack robustness in diverse and complex indoor scenes. In this letter, we propose DiScene, a novel sparse query-based framework that leverages multi-level distillation to achieve efficient and robust occupancy prediction. In particular, our method incorporates two key innovations: (1) a Multi-level Consistent Knowledge Distillation strategy, which transfers hierarchical representations from large teacher models to lightweight students through coordinated alignment across four levels, including encoder-level feature alignment, query-level feature matching, prior-level spatial guidance, and anchor-level high-confidence knowledge transfer and (2) a Teacher-Guided Initialization policy, employing optimized parameter warm-up to accelerate model convergence. Validated on the Occ-Scannet benchmark, DiScene achieves 23.2 FPS without depth priors while outperforming our baseline method, OPUS, by 36.1% and even better than the depth-enhanced version, OPUS$\dagger$. With depth integration, DiScene$\dagger$ attains new SOTA performance, surpassing EmbodiedOcc by 3.7% with 1.62× faster inference speed. Furthermore, experiments on the Occ3D-nuScenes benchmark and in-the-wild scenarios demonstrate the versatility of our approach in various environments.
Problem

Research questions and friction points this paper is trying to address.

occupancy prediction
sparse query
indoor scenes
efficiency-accuracy trade-off
robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Query
Multi-level Knowledge Distillation
Occupancy Prediction
Teacher-Guided Initialization
Efficient 3D Scene Understanding
🔎 Similar Papers
No similar papers found.