Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose robotic policies struggle to integrate non-visual modalities—such as touch and audition—leading to low manipulation success rates in vision-limited scenarios. Method: We propose FuSe, a framework that leverages natural language as a unified cross-modal semantic anchor to efficiently fine-tune existing vision-language-action (VLA) and diffusion-based policies. FuSe introduces a novel training paradigm comprising multimodal contrastive learning, sensor-driven language generation loss, and joint vision-language-action representation learning. Contribution/Results: The method enables zero-shot multimodal prompting, cross-modal compositional reasoning, and semantic description of interactive objects. Experiments demonstrate over 20% improvement in task success rate on real-world benchmarks. FuSe is the first approach to realize language as a universal interface for multimodal cooperative perception and decision-making, breaking away from the traditional vision-centric paradigm.

Technology Category

Application Category

📝 Abstract
Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities -- including vision, touch, and audio -- to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In the context of robot manipulation, we show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound in a zero-shot setting, such as multimodal prompting, compositional cross-modal prompting, and descriptions of objects it interacts with. We show that the same recipe is applicable to widely different generalist policies, including both diffusion-based generalist policies and large vision-language-action (VLA) models. Extensive experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.
Problem

Research questions and friction points this paper is trying to address.

Multisensory Integration
Robotic Perception
Non-visual Modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

FuSe (Sensory Fusion)
Multimodal Sensory Information
Language Understanding for Robots