UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenge of generating physically feasible manipulation sequences for dexterous hands from open-vocabulary language instructions, overcoming the limitations of conventional approaches that rely on object-centric cues or predefined interaction sequences. The authors propose the first unified framework for dexterous manipulation that supports open-vocabulary instructions, leveraging a vision-language-action model trained on human-object interaction data and integrating a physics-guided dynamic optimization module to produce smooth, executable motion trajectories. Key innovations include a unified hand morphology encoder enabling cross-morphology generalization and a novel training paradigm that utilizes only human demonstration data without requiring teleoperation. Experiments demonstrate state-of-the-art performance across multiple datasets and real-world scenarios, with strong generalization to both seen and unseen objects and trajectories while maintaining high physical plausibility.

Technology Category

Application Category

📝 Abstract

Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility. Our project page at \href{https://unihm.github.io/}{https://unihm.github.io/}.

Problem

Research questions and friction points this paper is trying to address.

dexterous hand manipulation

vision language model

open-vocabulary instruction

physical feasibility

cross-morphology generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

dexterous hand manipulation

vision language model

unified tokenizer