OMCL: Open-vocabulary Monte Carlo Localization

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the lack of robustness in map-observation feature association for robot localization across heterogeneous sensors and modalities. We propose a Monte Carlo localization framework leveraging open-vocabulary vision-language features. Methodologically, we introduce vision-language models (e.g., CLIP) into particle filtering for the first time, exploiting their zero-shot semantic understanding capability to construct 3D semantic maps from RGB-D or point cloud data and reformulate the observation likelihood function—enabling natural language–guided global initialization and pose estimation. Our contributions are threefold: (1) cross-modal alignment of visual, geometric, and linguistic features; (2) open-vocabulary, zero-shot, language-driven global localization; and (3) state-of-the-art accuracy on indoor benchmarks (Matterport3D and Replica) and strong cross-domain generalization on the outdoor SemanticKITTI dataset, significantly enhancing localization robustness and adaptability.

Technology Category

Application Category

📝 Abstract

Robust robot localization is an important prerequisite for navigation planning. If the environment map was created from different sensors, robot measurements must be robustly associated with map features. In this work, we extend Monte Carlo Localization using vision-language features. These open-vocabulary features enable to robustly compute the likelihood of visual observations, given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds. The abstract vision-language features enable to associate observations and map elements from different modalities. Global localization can be initialized by natural language descriptions of the objects present in the vicinity of locations. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.

Problem

Research questions and friction points this paper is trying to address.

Extends robot localization using vision-language features for robustness

Enables cross-modal association between observations and map elements

Allows global initialization via natural language descriptions of environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Monte Carlo Localization with vision-language features

Uses open-vocabulary features for robust observation likelihood computation

Enables cross-modal association and natural language initialization

🔎 Similar Papers

Find Everything: A General Vision Language Model Approach to Multi-Object Search