LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation

📅 2025-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing few-shot dexterous robotic manipulation approaches suffer from high computational overhead, reliance on multi-view inputs and NeRF/Gaussian Splatting, lengthy training times, and insufficient dexterity in grasping. To address these challenges, this paper proposes LensDFF, a language-enhanced sparse feature distillation framework. LensDFF is the first method to efficiently and view-consistently distill semantic features from 2D vision foundation models (e.g., CLIP) into 3D point clouds under single-view observation. It introduces language-guided sparse feature fusion and grasp primitive embedding, and establishes a real-to-sim evaluation pipeline for rapid hyperparameter tuning. Extensive experiments in simulation and on real robots demonstrate that LensDFF significantly outperforms state-of-the-art methods: it improves grasp success rate and dexterity, accelerates inference by 3.2×, and reduces training cost by 67%.

Technology Category

Application Category

📝 Abstract
Learning dexterous manipulation from few-shot demonstrations is a significant yet challenging problem for advanced, human-like robotic systems. Dense distilled feature fields have addressed this challenge by distilling rich semantic features from 2D visual foundation models into the 3D domain. However, their reliance on neural rendering models such as Neural Radiance Fields (NeRF) or Gaussian Splatting results in high computational costs. In contrast, previous approaches based on sparse feature fields either suffer from inefficiencies due to multi-view dependencies and extensive training or lack sufficient grasp dexterity. To overcome these limitations, we propose Language-ENhanced Sparse Distilled Feature Field (LensDFF), which efficiently distills view-consistent 2D features onto 3D points using our novel language-enhanced feature fusion strategy, thereby enabling single-view few-shot generalization. Based on LensDFF, we further introduce a few-shot dexterous manipulation framework that integrates grasp primitives into the demonstrations to generate stable and highly dexterous grasps. Moreover, we present a real2sim grasp evaluation pipeline for efficient grasp assessment and hyperparameter tuning. Through extensive simulation experiments based on the real2sim pipeline and real-world experiments, our approach achieves competitive grasping performance, outperforming state-of-the-art approaches.
Problem

Research questions and friction points this paper is trying to address.

Efficient few-shot dexterous manipulation learning
Overcoming high computational costs in feature distillation
Enhancing grasp dexterity with language-enhanced feature fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-enhanced sparse feature fusion strategy
Single-view few-shot generalization capability
Real2sim grasp evaluation pipeline
🔎 Similar Papers
No similar papers found.
Qian Feng
Qian Feng
Palo Alto Networks
AI Driven threat detectionFuzzingProgram Analysis
D
David S. Martinez Lema
Agile Robots SE, TUM School of Information Computation and Technology, Technical University of Munich
Jianxiang Feng
Jianxiang Feng
TUM-Technical University of Munich / Agile Robots
Probabilistic RoboticsUncertainty QuantificationPerception and Manipulation
Z
Zhaopeng Chen
Agile Robots SE, TUM School of Information Computation and Technology, Technical University of Munich
Alois Knoll
Alois Knoll
Technische Universität München
RoboticsAISensor Data FusionAutonomous DrivingCyber Physical Systems