🤖 AI Summary
This work addresses limitations in existing 3D masked autoencoding methods, which employ fixed masking ratios and fail to account for multi-scale representation dependencies and the geometric diversity of point clouds, while also relying on oversimplified point-to-point reconstruction assumptions that struggle with complex structures. To overcome these issues, we propose a self-distillation framework with variable masking ratios, featuring a dual-level self-representation alignment mechanism that aligns complementary geometric semantics across different masking ratios and temporal steps within MAE and MeanFlow Transformer architectures. Additionally, we introduce a flow-conditioned fine-tuning module to enable diverse probabilistic reconstruction. The method achieves state-of-the-art performance, surpassing Point-MAE by 5.37% on ScanObjectNN, attaining mIoU scores of 96.07% (arteries) and 86.87% (aneurysms) in intracranial aneurysm segmentation, and reaching 47.3% AP@50 in 3D object detection—outperforming MaskPoint by 5.12%.
📝 Abstract
Masked autoencoders (MAE) have become a dominant paradigm in 3D representation learning, setting new performance benchmarks across various downstream tasks. Existing methods with fixed mask ratio neglect multi-level representational correlations and intrinsic geometric structures, while relying on point-wise reconstruction assumptions that conflict with the diversity of point cloud. To address these issues, we propose a 3D representation learning method, termed Point-SRA, which aligns representations through self-distillation and probabilistic modeling. Specifically, we assign different masking ratios to the MAE to capture complementary geometric and semantic information, while the MeanFlow Transformer (MFT) leverages cross-modal conditional embeddings to enable diverse probabilistic reconstruction. Our analysis further reveals that representations at different time steps in MFT also exhibit complementarity. Therefore, a Dual Self-Representation Alignment mechanism is proposed at both the MAE and MFT levels. Finally, we design a Flow-Conditioned Fine-Tuning Architecture to fully exploit the point cloud distribution learned via MeanFlow. Point-SRA outperforms Point-MAE by 5.37% on ScanObjectNN. On intracranial aneurysm segmentation, it reaches 96.07% mean IoU for arteries and 86.87% for aneurysms. For 3D object detection, Point-SRA achieves 47.3% AP@50, surpassing MaskPoint by 5.12%.