VLH: Vision-Language-Haptics Foundation Model

📅 2025-08-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enhancing expressivity and immersion in human–drone interaction within aerial robotics and virtual reality through a unified vision–language–touch foundation model. Methodologically, it pioneers the integration of touch as an active perceptual modality—co-evolving with vision and language—and builds upon the OpenVLA backbone, employing LoRA fine-tuning and INT8 quantization to enable end-to-end mapping from multimodal inputs to 7-DOF action vectors on high-performance servers, directly controlling drone-mounted haptic actuators. The core contribution is a context-aware active haptic synthesis mechanism, enabling natural-language-guided tactile output. Experimental evaluation demonstrates a 56.7% target acquisition success rate across 90 flight trials, 100% accuracy in texture recognition, and 70.0% generalization performance on unseen tasks.

Technology Category

Application Category

📝 Abstract
We present VLH, a novel Visual-Language-Haptic Foundation Model that unifies perception, language, and tactile feedback in aerial robotics and virtual reality. Unlike prior work that treats haptics as a secondary, reactive channel, VLH synthesizes mid-air force and vibration cues as a direct consequence of contextual visual understanding and natural language commands. Our platform comprises an 8-inch quadcopter equipped with dual inverse five-bar linkage arrays for localized haptic actuation, an egocentric VR camera, and an exocentric top-down view. Visual inputs and language instructions are processed by a fine-tuned OpenVLA backbone - adapted via LoRA on a bespoke dataset of 450 multimodal scenarios - to output a 7-dimensional action vector (Vx, Vy, Vz, Hx, Hy, Hz, Hv). INT8 quantization and a high-performance server ensure real-time operation at 4-5 Hz. In human-robot interaction experiments (90 flights), VLH achieved a 56.7% success rate for target acquisition (mean reach time 21.3 s, pose error 0.24 m) and 100% accuracy in texture discrimination. Generalization tests yielded 70.0% (visual), 54.4% (motion), 40.0% (physical), and 35.0% (semantic) performance on novel tasks. These results demonstrate VLH's ability to co-evolve haptic feedback with perceptual reasoning and intent, advancing expressive, immersive human-robot interactions.
Problem

Research questions and friction points this paper is trying to address.

Unifies vision, language, haptics for aerial robotics and VR
Synthesizes force feedback from visual and language inputs
Enables real-time expressive human-robot interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies vision, language, haptics in robotics
Uses quadcopter with dual linkage haptic arrays
Adapts OpenVLA via LoRA for multimodal processing
🔎 Similar Papers
No similar papers found.