Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of 4D point cloud video data, which hinders the scalability of self-supervised models, and the performance degradation caused by directly transferring 3D pre-trained models to 4D tasks due to modality gaps and overfitting. To overcome these challenges, the authors propose PointATA, a two-stage “Align-then-Adapt” paradigm. First, optimal transport theory is employed to align the distributions of 3D and 4D data, mitigating inter-modal discrepancies. Subsequently, a lightweight point video adapter is introduced to enhance temporal modeling while keeping the pre-trained 3D backbone frozen. This approach uniquely decouples parameter-efficient transfer learning into alignment and adaptation stages. PointATA achieves or surpasses full fine-tuning performance across multiple 4D tasks with remarkable parameter efficiency: 97.21% accuracy on 3D action recognition, an 8.7% improvement on 4D action segmentation, and 84.06% on 4D semantic segmentation.

Technology Category

Application Category

📝 Abstract
Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.
Problem

Research questions and friction points this paper is trying to address.

4D perception
parameter-efficient transfer learning
modality gap
overfitting
point cloud video
Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter-efficient transfer learning
4D perception
optimal transport
modality gap
point cloud video
🔎 Similar Papers
No similar papers found.
Yiding Sun
Yiding Sun
Renmin University of China
Large Language ModelsExplainable Recommendation
J
Jihua Zhu
School of Software Engineering, Xi’an Jiaotong University, Xi’an 710048, China; State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University, Xi’an 710048, China
Haozhe Cheng
Haozhe Cheng
Xi'an Jiaotong University
3D visionDeep learning
Chaoyi Lu
Chaoyi Lu
Zhongguancun Laboratory
network securityinternet measurement
Z
Zhichuan Yang
School of Software Engineering, Xi’an Jiaotong University, Xi’an 710048, China
L
Lin Chen
School of Software Engineering, Xi’an Jiaotong University, Xi’an 710048, China
Y
Yaonan Wang
School of Electrical and Information Engineering, Hunan University, Changsha 410082, China; National Engineering Research Center for Robot Visual Perception and Control Technology, Changsha 410082, China