VLH: Vision-Language-Haptics Foundation Model

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the challenge of enhancing expressivity and immersion in human–drone interaction within aerial robotics and virtual reality through a unified vision–language–touch foundation model. Methodologically, it pioneers the integration of touch as an active perceptual modality—co-evolving with vision and language—and builds upon the OpenVLA backbone, employing LoRA fine-tuning and INT8 quantization to enable end-to-end mapping from multimodal inputs to 7-DOF action vectors on high-performance servers, directly controlling drone-mounted haptic actuators. The core contribution is a context-aware active haptic synthesis mechanism, enabling natural-language-guided tactile output. Experimental evaluation demonstrates a 56.7% target acquisition success rate across 90 flight trials, 100% accuracy in texture recognition, and 70.0% generalization performance on unseen tasks.

Technology Category

Application Category

📝 Abstract

We present VLH, a novel Visual-Language-Haptic Foundation Model that unifies perception, language, and tactile feedback in aerial robotics and virtual reality. Unlike prior work that treats haptics as a secondary, reactive channel, VLH synthesizes mid-air force and vibration cues as a direct consequence of contextual visual understanding and natural language commands. Our platform comprises an 8-inch quadcopter equipped with dual inverse five-bar linkage arrays for localized haptic actuation, an egocentric VR camera, and an exocentric top-down view. Visual inputs and language instructions are processed by a fine-tuned OpenVLA backbone - adapted via LoRA on a bespoke dataset of 450 multimodal scenarios - to output a 7-dimensional action vector (Vx, Vy, Vz, Hx, Hy, Hz, Hv). INT8 quantization and a high-performance server ensure real-time operation at 4-5 Hz. In human-robot interaction experiments (90 flights), VLH achieved a 56.7% success rate for target acquisition (mean reach time 21.3 s, pose error 0.24 m) and 100% accuracy in texture discrimination. Generalization tests yielded 70.0% (visual), 54.4% (motion), 40.0% (physical), and 35.0% (semantic) performance on novel tasks. These results demonstrate VLH's ability to co-evolve haptic feedback with perceptual reasoning and intent, advancing expressive, immersive human-robot interactions.

Problem

Research questions and friction points this paper is trying to address.

Unifies vision, language, haptics for aerial robotics and VR

Synthesizes force feedback from visual and language inputs

Enables real-time expressive human-robot interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies vision, language, haptics in robotics

Uses quadcopter with dual linkage haptic arrays

Adapts OpenVLA via LoRA for multimodal processing

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

2024-02-20arXiv.orgCitations: 41

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

2024-07-09arXiv.orgCitations: 7

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker

2024-10-02arXiv.orgCitations: 0

💼 Related Jobs

AI Research Scientist, Robotics