BalanceGS: Algorithm-System Co-design for Efficient 3D Gaussian Splatting Training on GPU

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key efficiency bottlenecks in 3D Gaussian Splatting (3DGS) training—non-uniform Gaussian densification, load imbalance during projection, and memory-access fragmentation in color splatting—this paper proposes an algorithm-system co-optimization framework. We introduce a workload-aware density control mechanism, a feature-similarity-driven adaptive sampling and merging strategy, and a GPU memory hierarchy-aware batch loading with reordering. By integrating heuristic density regulation, dynamic thread-level task assignment, and shared-memory optimization, we significantly improve projection and rendering throughput. Evaluated on an NVIDIA A100 GPU, our approach achieves a 1.44× end-to-end training speedup with only a negligible 0.03 dB PSNR degradation, preserving near-lossless reconstruction quality.

Technology Category

Application Category

📝 Abstract
3D Gaussian Splatting (3DGS) has emerged as a promising 3D reconstruction technique. The traditional 3DGS training pipeline follows three sequential steps: Gaussian densification, Gaussian projection, and color splatting. Despite its promising reconstruction quality, this conventional approach suffers from three critical inefficiencies: (1) Skewed density allocation during Gaussian densification, (2) Imbalanced computation workload during Gaussian projection and (3) Fragmented memory access during color splatting. To tackle the above challenges, we introduce BalanceGS, the algorithm-system co-design for efficient training in 3DGS. (1) At the algorithm level, we propose heuristic workload-sensitive Gaussian density control to automatically balance point distributions - removing 80% redundant Gaussians in dense regions while filling gaps in sparse areas. (2) At the system level, we propose Similarity-based Gaussian sampling and merging, which replaces the static one-to-one thread-pixel mapping with adaptive workload distribution - threads now dynamically process variable numbers of Gaussians based on local cluster density. (3) At the mapping level, we propose reordering-based memory access mapping strategy that restructures RGB storage and enables batch loading in shared memory. Extensive experiments demonstrate that compared with 3DGS, our approach achieves a 1.44$ imes$ training speedup on a NVIDIA A100 GPU with negligible quality degradation.
Problem

Research questions and friction points this paper is trying to address.

Addresses skewed density allocation in Gaussian densification
Resolves imbalanced computation workload during Gaussian projection
Optimizes fragmented memory access in color splatting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heuristic density control balances Gaussian point distributions
Similarity-based sampling enables adaptive workload distribution
Reordering memory access restructures RGB storage efficiently
🔎 Similar Papers
No similar papers found.
J
Junyi Wu
Shanghai Jiao Tong University
J
Jiaming Xu
Shanghai Jiao Tong University, SII
J
Jinhao Li
Shanghai Jiao Tong University, SII
Y
Yongkang Zhou
Shanghai Jiao Tong University, SII
J
Jiayi Pan
Shanghai Jiao Tong University, Infinigence-AI
Xingyang Li
Xingyang Li
Shanghai Jiao Tong Universty
Machine Learning Systems
Guohao Dai
Guohao Dai
Associate Professor of Shanghai Jiao Tong University
Sparse ComputationLarge-scale Graph ProcessingFPGACircuits and Systems