Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current spatial intelligence research, which relies on small-scale, manually annotated 3D datasets that suffer from domain bias. We propose the first fully automatic, end-to-end pipeline to construct a large-scale, multi-granularity, multimodal 3D spatial perception dataset directly from raw video streams. By integrating 3D Gaussian Splatting for high-fidelity scene reconstruction, our method automatically generates 2D/3D masks, bounding boxes, instance descriptions, and spatial question-answer pairs, providing multi-level supervision signals across geometric, semantic, and relational dimensions. The resulting Holi-Spatial-4M dataset comprises 12K optimized 3D scenes and over 4 million annotations, significantly outperforming existing approaches on benchmarks such as ScanNet and effectively enhancing the spatial reasoning capabilities of vision-language models.

Technology Category

Application Category

📝 Abstract
The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.
Problem

Research questions and friction points this paper is trying to address.

spatial intelligence
3D data
large-scale dataset
domain gap
spatial reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting
automated dataset curation
spatial reasoning
multimodal 3D data
vision-language models
🔎 Similar Papers
No similar papers found.