Find Everything: A General Vision Language Model Approach to Multi-Object Search

📅 2024-10-01

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 1

career value

177K/year

🤖 AI Summary

This paper addresses the Multi-Objective Search (MOS) problem in unknown environments—i.e., efficiently localizing multiple semantic targets while minimizing path cost. We propose an end-to-end navigation framework grounded in Vision-Language Models (VLMs), whose core innovation is a multi-channel score map mechanism: it jointly models spatial distributions of individual targets and cross-target semantic correlations, while integrating scene-level and object-level semantic alignment embeddings to support dynamic target addition/removal and long-horizon planning. The method enables semantic-driven joint reasoning and policy learning. It significantly outperforms existing deep reinforcement learning and VLM-based baselines in both simulation and real-world settings. Ablation studies validate the efficacy of each component, and scalability experiments demonstrate robust performance on complex search tasks involving 10+ targets.

Technology Category

Application Category

📝 Abstract

The Multi-Object Search (MOS) problem involves navigating to a sequence of locations to maximize the likelihood of finding target objects while minimizing travel costs. In this paper, we introduce a novel approach to the MOS problem, called Finder, which leverages vision language models (VLMs) to locate multiple objects across diverse environments. Specifically, our approach introduces multi-channel score maps to track and reason about multiple objects simultaneously during navigation, along with a score map technique that combines scene-level and object-level semantic correlations. Experiments in both simulated and real-world settings showed that Finder outperforms existing methods using deep reinforcement learning and VLMs. Ablation and scalability studies further validated our design choices and robustness with increasing numbers of target objects, respectively. Website: https://find-all-my-things.github.io/

Problem

Research questions and friction points this paper is trying to address.

Solves Multi-Object Search by minimizing travel costs.

Uses vision language models to locate multiple objects.

Introduces multi-channel score maps for object tracking.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision language models for object search

Uses multi-channel score maps for tracking objects

Combines scene-level and object-level semantic correlations

🔎 Similar Papers

No similar papers found.