🤖 AI Summary
Point-based neural rendering (PBNR) on mobile GPUs suffers from three critical bottlenecks: load imbalance in level-of-detail (LoD) search, irregular memory access patterns, and severe thread divergence during rasterization. This paper proposes an algorithm–architecture co-design: (1) the SLTree—a subtree-structured data structure—coupled with the LTcore hardware unit enables low-overhead, highly parallel LoD search; and (2) a divergence-free rasterization algorithm integrated with the SPcore hardware accelerator eliminates thread-level control-flow divergence. The approach requires only lightweight GPU hardware extensions. Evaluation shows 3.9× speedup and 98% energy reduction over baseline mobile GPUs, and 1.8× higher performance with 54% lower energy versus state-of-the-art accelerators, at negligible area overhead. Our core contribution is the first systematic integration of data-structure innovation, domain-specific hardware design, and algorithmic restructuring—explicitly addressing the sparsity and irregularity inherent to mobile PBNR workloads.
📝 Abstract
Rendering is critical in fields like 3D modeling, AR/VR, and autonomous driving, where high-quality, real-time output is essential. Point-based neural rendering (PBNR) offers a photorealistic and efficient alternative to conventional methods, yet it is still challenging to achieve real-time rendering on mobile platforms. We pinpoint two major bottlenecks in PBNR pipelines: LoD search and splatting. LoD search suffers from workload imbalance and irregular memory access, making it inefficient on off-the-shelf GPUs. Meanwhile, splatting introduces severe warp divergence across GPU threads due to its inherent sparsity.
To tackle these challenges, we propose SLTarch, an algorithm-architecture co-designed framework. At its core, SLTarch introduces SLTree, a dedicated subtree-based data structure, and LTcore, a specialized hardware architecture tailored for efficient LoD search. Additionally, we co-design a divergence-free splatting algorithm with our simple yet principled hardware augmentation, SPcore, to existing PBNR accelerators. Compared to a mobile GPU, SLTarch achieves 3.9$ imes$ speedup and 98% energy savings with negligible architecture overhead. Compared to existing accelerator designs, SLTarch achieves 1.8$ imes$ speedup with 54% energy savings.