Octree Latent Diffusion for Semantic 3D Scene Generation and Completion

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing 3D semantic scene completion, extrapolation, and generation lack a unified modeling framework, and cross-domain generalization (e.g., indoor/outdoor) typically requires retraining. To address this, we propose the Octree Implicit Diffusion Framework (OIDF), which employs a dual-octree latent representation to decouple geometric structure and semantic modeling. In the first stage, structural diffusion operates on an octree-based geometric latent space to synthesize topology; in the second stage, conditional diffusion in a semantic latent space injects fine-grained class information. Integrated with a graph-structured VAE encoder and inference-time latent-space inpainting/outpainting, OIDF enables zero-shot cross-domain generation and completion. Experiments demonstrate that, given only a single-frame LiDAR input, OIDF achieves high-fidelity, semantically consistent 3D reconstruction. Crucially, it exhibits strong out-of-distribution robustness and generalizes effectively across domains without fine-tuning.

Technology Category

Application Category

📝 Abstract

The completion, extension, and generation of 3D semantic scenes are an interrelated set of capabilities that are useful for robotic navigation and exploration. Existing approaches seek to decouple these problems and solve them oneoff. Additionally, these approaches are often domain-specific, requiring separate models for different data distributions, e.g. indoor vs. outdoor scenes. To unify these techniques and provide cross-domain compatibility, we develop a single framework that can perform scene completion, extension, and generation in both indoor and outdoor scenes, which we term Octree Latent Semantic Diffusion. Our approach operates directly on an efficient dual octree graph latent representation: a hierarchical, sparse, and memory-efficient occupancy structure. This technique disentangles synthesis into two stages: (i) structure diffusion, which predicts binary split signals to construct a coarse occupancy octree, and (ii) latent semantic diffusion, which generates semantic embeddings decoded by a graph VAE into voxellevel semantic labels. To perform semantic scene completion or extension, our model leverages inference-time latent inpainting, or outpainting respectively. These inference-time methods use partial LiDAR scans or maps to condition generation, without the need for retraining or finetuning. We demonstrate highquality structure, coherent semantics, and robust completion from single LiDAR scans, as well as zero-shot generalization to out-of-distribution LiDAR data. These results indicate that completion-through-generation in a dual octree graph latent space is a practical and scalable alternative to regression-based pipelines for real-world robotic perception tasks.

Problem

Research questions and friction points this paper is trying to address.

Unifying scene completion, extension, and generation across indoor and outdoor domains

Developing a single framework that avoids domain-specific models for different data distributions

Creating robust 3D semantic scene generation from partial LiDAR scans without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual octree graph latent representation for efficiency

Two-stage diffusion: structure prediction and semantic generation

Inference-time inpainting/outpainting without retraining for generalization

🔎 Similar Papers

LT3SD: Latent Trees for 3D Scene Diffusion