Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In 3D semantic scene completion (SSC), voxel-based representations suffer from coarse-grained class-level modeling. To address this, we propose DISC, a dual-stream Transformer architecture. Its core innovation is the introduction of discriminative class queries—replacing conventional voxel queries—to decouple instance-aware and scene-contextual modeling. Additionally, we design a class-specific decoder that jointly fuses geometric and semantic priors. This enables fine-grained, end-to-end optimized class-level feature representation. DISC achieves state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360, attaining mIoU scores of 17.35 and 20.55, respectively. Moreover, its instance-level mIoU improves by 17.9% over the best single-frame method and by 11.9% over the best multi-frame method, significantly enhancing both fine-grained reconstruction fidelity and class sensitivity in completed scenes.

Technology Category

Application Category

📝 Abstract
3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose extbf{D}isentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-of-the-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only single-frame input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9% and 11.9%, respectively, on the SemanticKITTI hidden test. The code is available at https://github.com/Enyu-Liu/DISC.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D scene completion by separating instance and scene contexts
Improving class-level information utilization in voxel-based 3D perception
Designing specialized decoding modules for targeted class interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream paradigm for instance and scene contexts
Class queries replace voxel queries for better learning
Specialized decoding modules enhance class-level information flow
E
Enyu Liu
Huazhong University of Science and Technology
E
En Yu
Huazhong University of Science and Technology
S
Sijia Chen
Huazhong University of Science and Technology
Wenbing Tao
Wenbing Tao
Professor of School of Automation, Huazhong University of Science and Technology
image processingcomputer visionpattern recognition