PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key limitations in vision-based robotic hand control—namely, reliance on explicit pose estimation, depth sensing, and large-scale annotated datasets—by proposing an end-to-end framework that directly maps 2D images to joint angles. Methodologically, it eliminates dedicated depth or pose estimation modules and trains exclusively on synthetically generated, randomized joint configurations. Inspired by vision-language models, it employs image tokenization followed by a Transformer decoder for continuous joint-angle regression. The core contribution is the first demonstration of zero-shot cross-morphology transfer—from robotic hands to human hands—without any real-world annotations, depth maps, or pose priors. Evaluated on physical hardware, the approach achieves competitive accuracy under real-world conditions, while maintaining low inference latency and strong robustness to occlusion, lighting variation, and viewpoint changes.

Technology Category

Application Category

📝 Abstract
This paper introduces PoseLess, a novel framework for robot hand control that eliminates the need for explicit pose estimation by directly mapping 2D images to joint angles using tokenized representations. Our approach leverages synthetic training data generated through randomized joint configurations, enabling zero-shot generalization to real-world scenarios and cross-morphology transfer from robotic to human hands. By tokenizing visual inputs and employing a transformer-based decoder, PoseLess achieves robust, low-latency control while addressing challenges such as depth ambiguity and data scarcity. Experimental results demonstrate competitive performance in joint angle prediction accuracy without relying on any human-labelled dataset.
Problem

Research questions and friction points this paper is trying to address.

Direct 2D image to joint angle mapping without pose estimation.
Zero-shot generalization to real-world and cross-morphology scenarios.
Robust, low-latency control using tokenized visual inputs and transformers.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct 2D image to joint angle mapping
Synthetic data for zero-shot generalization
Transformer-based decoder for low-latency control
🔎 Similar Papers
2024-05-28International Conference on Learning RepresentationsCitations: 10