🤖 AI Summary
Existing self-supervised methods for point clouds rely on masked reconstruction or explicit geometric generation, which struggle to effectively model structured predictive dependencies. This work proposes a decoder-free, causal implicit next-token prediction paradigm that, for the first time, partitions point clouds into local patches and serializes them into structured 3D token sequences. By leveraging a causal Transformer trained under prefix-only conditioning, the approach directly captures intrinsic structural dependencies within point clouds. The method integrates geometric serialization, prefix-conditioned modeling, displacement-based prediction targets, and gradient stopping strategies, achieving state-of-the-art results across multiple benchmarks: ScanObjectNN (93.8%, 92.6%, and 89.3% accuracy under three variants), ShapeNetPart (85.0% class-wise mean Intersection-over-Union), and S3DIS Area 5 (71.1% mean accuracy).
📝 Abstract
With the rapid progress of multimodal foundation models and predictive pre-training, an important open question is how to equip 3D point clouds with a pre-training paradigm that is better aligned with next-token and next-embedding learning. Existing point-cloud self-supervised methods are largely built on masked reconstruction or explicit geometric generation, and thus remain tied to input recovery rather than predictive dependency modeling. In this paper, we introduce PointNTP, which reformulates point cloud pre-training as a fully causal, decoder-free latent Next-Token Prediction problem. Specifically, each point cloud is first partitioned into local patches and serialized into a structured 3D token sequence according to patch-center geometry. The resulting sequence is then modeled by a causal Transformer under prefix-only conditioning, and trained with a shift-based prediction objective stabilized by stop-gradient targets. This design enables the model to learn structural dependencies directly in latent space, without reconstruction decoders or explicit geometric recovery. Extensive experiments demonstrate that the proposed PointNTP is highly competitive across multiple downstream tasks: it achieves 93.8%(+0.5%), 92.6%(+0.3%), and 89.3%(+1.1%) on OBJ_BG, OBJ_ONLY, and PB_T50_RS of ScanObjectNN, respectively; obtains 85.0%(+0.1%) in Cls.mIoU on ShapeNetPart; and reaches 71.1% mAcc on S3DIS Area 5. Overall, decoder-free causal latent prediction provides a simple, scalable, and potentially modality-agnostic paradigm for point-cloud self-supervised learning, offering a new 3D perspective on foundation-style predictive learning for 3D data.