Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

General-purpose robotic policies struggle to integrate non-visual modalities—such as touch and audition—leading to low manipulation success rates in vision-limited scenarios. Method: We propose FuSe, a framework that leverages natural language as a unified cross-modal semantic anchor to efficiently fine-tune existing vision-language-action (VLA) and diffusion-based policies. FuSe introduces a novel training paradigm comprising multimodal contrastive learning, sensor-driven language generation loss, and joint vision-language-action representation learning. Contribution/Results: The method enables zero-shot multimodal prompting, cross-modal compositional reasoning, and semantic description of interactive objects. Experiments demonstrate over 20% improvement in task success rate on real-world benchmarks. FuSe is the first approach to realize language as a universal interface for multimodal cooperative perception and decision-making, breaking away from the traditional vision-centric paradigm.

Technology Category

Application Category

📝 Abstract

Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities -- including vision, touch, and audio -- to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In the context of robot manipulation, we show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound in a zero-shot setting, such as multimodal prompting, compositional cross-modal prompting, and descriptions of objects it interacts with. We show that the same recipe is applicable to widely different generalist policies, including both diffusion-based generalist policies and large vision-language-action (VLA) models. Extensive experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.

Problem

Research questions and friction points this paper is trying to address.

Multisensory Integration

Robotic Perception

Non-visual Modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

FuSe (Sensory Fusion)

Multimodal Sensory Information

Language Understanding for Robots

🔎 Similar Papers

GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance

2024-10-09Citations: 1

A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

2024-09-28arXiv.orgCitations: 0

Open-vocabulary Pick and Place via Patch-level Semantic Maps

2024-06-21arXiv.orgCitations: 6

Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

2023-12-17Citations: 10