HumanoidVLM: Vision-Language-Guided Impedance Control for Contact-Rich Humanoid Manipulation

πŸ“… 2026-01-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of adaptive impedance and gripper configuration tuning in contact-rich manipulation tasks for humanoid robots, which traditionally rely on manual parameter tuning. The authors propose a novel control framework that integrates vision-language models with retrieval-augmented generation (RAG). By semantically interpreting task instructions, the system retrieves experimentally validated stiffness-damping parameter pairs and object-specific grasp orientations from a custom database using FAISS, and subsequently drives a task-space impedance controller to achieve compliant manipulation. This approach uniquely combines semantic understanding with retrieval-based control strategies, enabling interpretable and adaptive operation without manual tuning. Evaluated across 14 visual scenarios, the method achieves a 93% retrieval accuracy, exhibits z-axis tracking errors of 1–3.5 cm in physical trials, and demonstrates virtual force responses consistent with the required impedance characteristics, thereby validating its efficacy.

Technology Category

Application Category

πŸ“ Abstract
Humanoid robots must adapt their contact behavior to diverse objects and tasks, yet most controllers rely on fixed, hand-tuned impedance gains and gripper settings. This paper introduces HumanoidVLM, a vision-language driven retrieval framework that enables the Unitree G1 humanoid to select task-appropriate Cartesian impedance parameters and gripper configurations directly from an egocentric RGB image. The system couples a vision-language model for semantic task inference with a FAISS-based Retrieval-Augmented Generation (RAG) module that retrieves experimentally validated stiffness-damping pairs and object-specific grasp angles from two custom databases, and executes them through a task-space impedance controller for compliant manipulation. We evaluate HumanoidVLM on 14 visual scenarios and achieve a retrieval accuracy of 93%. Real-world experiments show stable interaction dynamics, with z-axis tracking errors typically within 1-3.5 cm and virtual forces consistent with task-dependent impedance settings. These results demonstrate the feasibility of linking semantic perception with retrieval-based control as an interpretable path toward adaptive humanoid manipulation.
Problem

Research questions and friction points this paper is trying to address.

humanoid manipulation
impedance control
contact-rich tasks
adaptive behavior
vision-language guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model
Impedance Control
Retrieval-Augmented Generation
Humanoid Manipulation
Compliant Interaction
πŸ”Ž Similar Papers
No similar papers found.