🤖 AI Summary
This work addresses dexterous human-robot collaboration tasks by proposing a lightweight language-driven vision-language-action (VLA) system capable of executing long-horizon cooperative operations—such as pick-and-pass—with minimal natural language prompts. Methodologically, it adapts the Open-VLA pretrained model with a FiLM-based conditioning mechanism to enhance task awareness; introduces an auxiliary intention head for joint prediction of human hand pose; compresses the action space via PCA dimensionality reduction and incremental action prediction; and integrates MediaPipe for real-time, multi-view hand pose estimation. Experiments show an end-to-end latency of ~300 ms, with the top four principal components capturing 96% of finger joint motion variance. Ablation studies identify action post-processing as the dominant performance factor. The approach significantly improves VLA models’ collaborative generalization and execution fluency under low-prompt overhead.
📝 Abstract
We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes "pick-up" and "pass" into a long-horizon behavior. We surface "trainer overfitting" to specific demonstrators as the key limitation.