🤖 AI Summary
This work addresses the limited representational capacity and generalization performance of 3D foundation models in few-shot point cloud learning by proposing PointRFT, the first explicit reinforcement-based fine-tuning paradigm. PointRFT introduces reinforcement learning into few-shot adaptation of 3D point clouds for the first time, employing a dual-reward mechanism that jointly optimizes accuracy and prediction diversity. It leverages Group Relative Policy Optimization (GRPO) to enable end-to-end refinement of mainstream 3D foundation models and establishes a novel hybrid training framework integrating pretraining, supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT). Extensive experiments demonstrate that PointRFT significantly outperforms conventional supervised fine-tuning methods across multiple few-shot classification benchmarks, achieving state-of-the-art performance under data-scarce conditions.
📝 Abstract
Understanding spatial dynamics and semantics in point cloud is fundamental for comprehensive 3D comprehension. While reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO) have recently achieved remarkable breakthroughs in large language models by incentivizing reasoning capabilities through strategic reward design, their potential remains largely unexplored in the 3D perception domain. This naturally raises a pivotal question: Can RL-based methods effectively empower 3D point cloud fine-tuning? In this paper, we propose PointRFT, the first reinforcement fine-tuning paradigm tailored specifically for point cloud representation learning. We select three prevalent 3D foundation models and devise specialized accuracy reward and dispersion reward functions to stabilize training and mitigate distribution shifts. Through comprehensive few-shot classification experiments comparing distinct training paradigms, we demonstrate that PointRFT consistently outperforms vanilla supervised fine-tuning (SFT) across diverse benchmarks. Furthermore, when organically integrated into a hybrid Pretraining-SFT-RFT paradigm, the representational capacity of point cloud foundation models is substantially unleashed, achieving state-of-the-art performance particularly under data-scarce scenarios.