PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses key limitations in vision-based robotic hand control—namely, reliance on explicit pose estimation, depth sensing, and large-scale annotated datasets—by proposing an end-to-end framework that directly maps 2D images to joint angles. Methodologically, it eliminates dedicated depth or pose estimation modules and trains exclusively on synthetically generated, randomized joint configurations. Inspired by vision-language models, it employs image tokenization followed by a Transformer decoder for continuous joint-angle regression. The core contribution is the first demonstration of zero-shot cross-morphology transfer—from robotic hands to human hands—without any real-world annotations, depth maps, or pose priors. Evaluated on physical hardware, the approach achieves competitive accuracy under real-world conditions, while maintaining low inference latency and strong robustness to occlusion, lighting variation, and viewpoint changes.

Technology Category

Application Category

📝 Abstract

This paper introduces PoseLess, a novel framework for robot hand control that eliminates the need for explicit pose estimation by directly mapping 2D images to joint angles using tokenized representations. Our approach leverages synthetic training data generated through randomized joint configurations, enabling zero-shot generalization to real-world scenarios and cross-morphology transfer from robotic to human hands. By tokenizing visual inputs and employing a transformer-based decoder, PoseLess achieves robust, low-latency control while addressing challenges such as depth ambiguity and data scarcity. Experimental results demonstrate competitive performance in joint angle prediction accuracy without relying on any human-labelled dataset.

Problem

Research questions and friction points this paper is trying to address.

Direct 2D image to joint angle mapping without pose estimation.

Zero-shot generalization to real-world and cross-morphology scenarios.

Robust, low-latency control using tokenized visual inputs and transformers.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct 2D image to joint angle mapping

Synthetic data for zero-shot generalization

Transformer-based decoder for low-latency control

🔎 Similar Papers

Hierarchical World Models as Visual Whole-Body Humanoid Controllers

2024-05-28International Conference on Learning RepresentationsCitations: 10

Field AI

Irvine, CA

Master Thesis AI-Based Keypoint Refinement for Autonomous Driving

Bosch Group

Hildesheim, NDS, DE

Authors to Follow