Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses dexterous human-robot collaboration tasks by proposing a lightweight language-driven vision-language-action (VLA) system capable of executing long-horizon cooperative operations—such as pick-and-pass—with minimal natural language prompts. Methodologically, it adapts the Open-VLA pretrained model with a FiLM-based conditioning mechanism to enhance task awareness; introduces an auxiliary intention head for joint prediction of human hand pose; compresses the action space via PCA dimensionality reduction and incremental action prediction; and integrates MediaPipe for real-time, multi-view hand pose estimation. Experiments show an end-to-end latency of ~300 ms, with the top four principal components capturing 96% of finger joint motion variance. Ablation studies identify action post-processing as the dominant performance factor. The approach significantly improves VLA models’ collaborative generalization and execution fluency under low-prompt overhead.

Technology Category

Application Category

📝 Abstract
We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes "pick-up" and "pass" into a long-horizon behavior. We surface "trainer overfitting" to specific demonstrators as the key limitation.
Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language-action models for dexterous human-robot collaboration
Enhancing task-aware perception and predicting collaborator intent cues
Developing compact action representations for complex robotic manipulation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

FiLM conditioning enables task-aware visual perception
Auxiliary intent head predicts collaborator hand poses
Action post-processing predicts compact deltas and finger joints
Boshi An
Boshi An
Peking University
RoboticsComputational Neuroscience
C
Chenyu Yang
Soft Robotics Lab, ETHz, Switzerland
R
Robert Katzschmann
Soft Robotics Lab, ETHz, Switzerland