🤖 AI Summary
Deploying large 3D foundation models on resource-constrained edge devices remains challenging, and existing compression methods often degrade their cross-task generalization capability. Method: This paper proposes Foundation Model Distillation (FMD), the first FMD framework tailored for 3D point clouds. It employs token-level knowledge distillation to guide a lightweight student model in learning a compact basis from the teacher’s self-supervised representation space and generating reconstructive “SuperTokens” that faithfully recover teacher features. Contribution/Results: FMD reduces token count (>80%) and computational overhead significantly while fully preserving the original model’s general-purpose representation capability across downstream tasks—including classification, segmentation, and few-shot transfer. Experiments demonstrate that a single distilled student model achieves near-teacher performance and is deployable on edge devices such as resource-limited robots and AR/VR systems—establishing the first efficient, general-purpose distillation solution for edge-deployable 3D foundation models.
📝 Abstract
Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient 'specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher's token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.