🤖 AI Summary
This work addresses key limitations in vision-based robotic hand control—namely, reliance on explicit pose estimation, depth sensing, and large-scale annotated datasets—by proposing an end-to-end framework that directly maps 2D images to joint angles. Methodologically, it eliminates dedicated depth or pose estimation modules and trains exclusively on synthetically generated, randomized joint configurations. Inspired by vision-language models, it employs image tokenization followed by a Transformer decoder for continuous joint-angle regression. The core contribution is the first demonstration of zero-shot cross-morphology transfer—from robotic hands to human hands—without any real-world annotations, depth maps, or pose priors. Evaluated on physical hardware, the approach achieves competitive accuracy under real-world conditions, while maintaining low inference latency and strong robustness to occlusion, lighting variation, and viewpoint changes.
📝 Abstract
This paper introduces PoseLess, a novel framework for robot hand control that eliminates the need for explicit pose estimation by directly mapping 2D images to joint angles using tokenized representations. Our approach leverages synthetic training data generated through randomized joint configurations, enabling zero-shot generalization to real-world scenarios and cross-morphology transfer from robotic to human hands. By tokenizing visual inputs and employing a transformer-based decoder, PoseLess achieves robust, low-latency control while addressing challenges such as depth ambiguity and data scarcity. Experimental results demonstrate competitive performance in joint angle prediction accuracy without relying on any human-labelled dataset.