Point-SRA: Self-Representation Alignment for 3D Representation Learning

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

This work addresses limitations in existing 3D masked autoencoding methods, which employ fixed masking ratios and fail to account for multi-scale representation dependencies and the geometric diversity of point clouds, while also relying on oversimplified point-to-point reconstruction assumptions that struggle with complex structures. To overcome these issues, we propose a self-distillation framework with variable masking ratios, featuring a dual-level self-representation alignment mechanism that aligns complementary geometric semantics across different masking ratios and temporal steps within MAE and MeanFlow Transformer architectures. Additionally, we introduce a flow-conditioned fine-tuning module to enable diverse probabilistic reconstruction. The method achieves state-of-the-art performance, surpassing Point-MAE by 5.37% on ScanObjectNN, attaining mIoU scores of 96.07% (arteries) and 86.87% (aneurysms) in intracranial aneurysm segmentation, and reaching 47.3% AP@50 in 3D object detection—outperforming MaskPoint by 5.12%.

Technology Category

Application Category

📝 Abstract

Masked autoencoders (MAE) have become a dominant paradigm in 3D representation learning, setting new performance benchmarks across various downstream tasks. Existing methods with fixed mask ratio neglect multi-level representational correlations and intrinsic geometric structures, while relying on point-wise reconstruction assumptions that conflict with the diversity of point cloud. To address these issues, we propose a 3D representation learning method, termed Point-SRA, which aligns representations through self-distillation and probabilistic modeling. Specifically, we assign different masking ratios to the MAE to capture complementary geometric and semantic information, while the MeanFlow Transformer (MFT) leverages cross-modal conditional embeddings to enable diverse probabilistic reconstruction. Our analysis further reveals that representations at different time steps in MFT also exhibit complementarity. Therefore, a Dual Self-Representation Alignment mechanism is proposed at both the MAE and MFT levels. Finally, we design a Flow-Conditioned Fine-Tuning Architecture to fully exploit the point cloud distribution learned via MeanFlow. Point-SRA outperforms Point-MAE by 5.37% on ScanObjectNN. On intracranial aneurysm segmentation, it reaches 96.07% mean IoU for arteries and 86.87% for aneurysms. For 3D object detection, Point-SRA achieves 47.3% AP@50, surpassing MaskPoint by 5.12%.

Problem

Research questions and friction points this paper is trying to address.

3D representation learning

masked autoencoders

point cloud

representation alignment

geometric structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Representation Alignment

Masked Autoencoder

Probabilistic Reconstruction