LitePT: Lighter Yet Stronger Point Transformer

πŸ“… 2025-12-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the unclear co-design of convolution and attention mechanisms in 3D point cloud modeling, this paper proposes a stage-wise hybrid architecture: lightweight depthwise separable convolutions extract local geometric features in early high-resolution layers, while lightweight attention modules model long-range semantic context in deeper low-resolution layers. We first uncover their complementary roles in point cloud processing. Furthermore, we introduce PointROPEβ€”a training-free, structure-aware 3D positional encoding that explicitly preserves spatial relationships. Experiments demonstrate that our method reduces parameter count by 3.6Γ—, doubles inference speed, and cuts GPU memory consumption by 50%, while matching or surpassing Point Transformer V3 on mainstream benchmarks.

Technology Category

Application Category

πŸ“ Abstract
Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has $3.6 imes$ fewer parameters, runs $2 imes$ faster, and uses $2 imes$ less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: https://github.com/prs-eth/LitePT.
Problem

Research questions and friction points this paper is trying to address.

Optimizing 3D point cloud network architecture by analyzing convolution and attention roles
Introducing a hybrid backbone using early convolutions and deep attention layers
Reducing parameters and memory while maintaining performance with positional encoding PointROPE
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid early convolution and deep attention architecture
Training-free positional encoding PointROPE for spatial information
Fewer parameters, faster speed, and lower memory usage
πŸ”Ž Similar Papers
No similar papers found.