Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing camera-based 3D occupancy prediction methods primarily rely on lightweight backbones or cascaded architectures, yielding limited performance gains while neglecting the complementary modeling of multi-level image features—semantic, geometric, and depth. To address this, we propose a two-stage multi-level feature fusion framework. In the first stage, deformable convolutions enable adaptive cross-scale and cross-modal fusion of segmentation, depth, and geometric features. In the second stage, SAM-guided knowledge distillation enhances the discriminability and geometric consistency of 3D occupancy grid predictions. Crucially, our method strengthens feature representation without increasing training overhead. Evaluated on SemanticKITTI, it achieves state-of-the-art performance, demonstrating an exceptional balance between accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract
Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose extbf{CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. extbf{CIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released https://github.com/VitaLemonTea1/CIGOcc
Problem

Research questions and friction points this paper is trying to address.

Predicting 3D scene geometry from 2D images for autonomous driving
Fusing multi-level image features like segmentation and depth
Enhancing occupancy prediction accuracy without increasing training costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level representation fusion for occupancy prediction
Deformable fusion of segmentation, graphics, depth features
Knowledge distillation from SAM enhances prediction accuracy
🔎 Similar Papers
No similar papers found.
Rongtao Xu
Rongtao Xu
MBZUAI << CASIA << HUST
Intelligent RobotEmbodied AIVLAVLMSpatialtemporal AI
J
Jinzhou Lin
Beijing University of Posts and Telecommunications, China
J
Jialei Zhou
Tongji University, China
J
Jiahua Dong
Shenyang Institute of Automation, Chinese Academy of Sciences, China
Changwei Wang
Changwei Wang
Shandong Computer Science Center
Multimodal LearningEmbodied AIEdge Intelligent ComputingAI for HealthcareSafety Alignment
R
Ruisheng Wang
University of Calgary, Canada
L
Li Guo
Beijing University of Posts and Telecommunications, China
Shibiao Xu
Shibiao Xu
Beijing University of Posts and Telecommunications
Computer VisionMachine LearningComputer Graphics
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning