Point-SRA: Self-Representation Alignment for 3D Representation Learning

📅 2026-01-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses limitations in existing 3D masked autoencoding methods, which employ fixed masking ratios and fail to account for multi-scale representation dependencies and the geometric diversity of point clouds, while also relying on oversimplified point-to-point reconstruction assumptions that struggle with complex structures. To overcome these issues, we propose a self-distillation framework with variable masking ratios, featuring a dual-level self-representation alignment mechanism that aligns complementary geometric semantics across different masking ratios and temporal steps within MAE and MeanFlow Transformer architectures. Additionally, we introduce a flow-conditioned fine-tuning module to enable diverse probabilistic reconstruction. The method achieves state-of-the-art performance, surpassing Point-MAE by 5.37% on ScanObjectNN, attaining mIoU scores of 96.07% (arteries) and 86.87% (aneurysms) in intracranial aneurysm segmentation, and reaching 47.3% AP@50 in 3D object detection—outperforming MaskPoint by 5.12%.

Technology Category

Application Category

📝 Abstract
Masked autoencoders (MAE) have become a dominant paradigm in 3D representation learning, setting new performance benchmarks across various downstream tasks. Existing methods with fixed mask ratio neglect multi-level representational correlations and intrinsic geometric structures, while relying on point-wise reconstruction assumptions that conflict with the diversity of point cloud. To address these issues, we propose a 3D representation learning method, termed Point-SRA, which aligns representations through self-distillation and probabilistic modeling. Specifically, we assign different masking ratios to the MAE to capture complementary geometric and semantic information, while the MeanFlow Transformer (MFT) leverages cross-modal conditional embeddings to enable diverse probabilistic reconstruction. Our analysis further reveals that representations at different time steps in MFT also exhibit complementarity. Therefore, a Dual Self-Representation Alignment mechanism is proposed at both the MAE and MFT levels. Finally, we design a Flow-Conditioned Fine-Tuning Architecture to fully exploit the point cloud distribution learned via MeanFlow. Point-SRA outperforms Point-MAE by 5.37% on ScanObjectNN. On intracranial aneurysm segmentation, it reaches 96.07% mean IoU for arteries and 86.87% for aneurysms. For 3D object detection, Point-SRA achieves 47.3% AP@50, surpassing MaskPoint by 5.12%.
Problem

Research questions and friction points this paper is trying to address.

3D representation learning
masked autoencoders
point cloud
representation alignment
geometric structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Representation Alignment
Masked Autoencoder
Probabilistic Reconstruction
MeanFlow Transformer
Variable Masking Ratio
🔎 Similar Papers
No similar papers found.
L
Lintong Wei
School of Electronics and Information, Xi’an Polytechnic University
Jian Lu
Jian Lu
Shenzhen University
Signal processingImage processingMachine Learning
Haozhe Cheng
Haozhe Cheng
Xi'an Jiaotong University
3D visionDeep learning
J
Jihua Zhu
School of Software, Xi’an Jiaotong University
K
Kaibing Zhang
School of Computer Science, Xi’an Polytechnic University