Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses dexterous human-robot collaboration tasks by proposing a lightweight language-driven vision-language-action (VLA) system capable of executing long-horizon cooperative operations—such as pick-and-pass—with minimal natural language prompts. Methodologically, it adapts the Open-VLA pretrained model with a FiLM-based conditioning mechanism to enhance task awareness; introduces an auxiliary intention head for joint prediction of human hand pose; compresses the action space via PCA dimensionality reduction and incremental action prediction; and integrates MediaPipe for real-time, multi-view hand pose estimation. Experiments show an end-to-end latency of ~300 ms, with the top four principal components capturing 96% of finger joint motion variance. Ablation studies identify action post-processing as the dominant performance factor. The approach significantly improves VLA models’ collaborative generalization and execution fluency under low-prompt overhead.

Technology Category

Application Category

📝 Abstract

We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes "pick-up" and "pass" into a long-horizon behavior. We surface "trainer overfitting" to specific demonstrators as the key limitation.

Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language-action models for dexterous human-robot collaboration

Enhancing task-aware perception and predicting collaborator intent cues

Developing compact action representations for complex robotic manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

FiLM conditioning enables task-aware visual perception

Auxiliary intent head predicts collaborator hand poses

Action post-processing predicts compact deltas and finger joints

🔎 Similar Papers

Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant

2024-09-30Citations: 3