๐ค AI Summary
This paper addresses the Multi-Objective Search (MOS) problem in unknown environmentsโi.e., efficiently localizing multiple semantic targets while minimizing path cost. We propose an end-to-end navigation framework grounded in Vision-Language Models (VLMs), whose core innovation is a multi-channel score map mechanism: it jointly models spatial distributions of individual targets and cross-target semantic correlations, while integrating scene-level and object-level semantic alignment embeddings to support dynamic target addition/removal and long-horizon planning. The method enables semantic-driven joint reasoning and policy learning. It significantly outperforms existing deep reinforcement learning and VLM-based baselines in both simulation and real-world settings. Ablation studies validate the efficacy of each component, and scalability experiments demonstrate robust performance on complex search tasks involving 10+ targets.
๐ Abstract
The Multi-Object Search (MOS) problem involves navigating to a sequence of locations to maximize the likelihood of finding target objects while minimizing travel costs. In this paper, we introduce a novel approach to the MOS problem, called Finder, which leverages vision language models (VLMs) to locate multiple objects across diverse environments. Specifically, our approach introduces multi-channel score maps to track and reason about multiple objects simultaneously during navigation, along with a score map technique that combines scene-level and object-level semantic correlations. Experiments in both simulated and real-world settings showed that Finder outperforms existing methods using deep reinforcement learning and VLMs. Ablation and scalability studies further validated our design choices and robustness with increasing numbers of target objects, respectively. Website: https://find-all-my-things.github.io/