DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

📅 2024-10-03

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 2

career value

202K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) exhibit weak generalization and poor cross-scene/cross-object adaptability in open-vocabulary embodied navigation. Method: We introduce DivScene—the first large-scale, diverse benchmark comprising 81 object categories and 4,614 scenes—and propose NatVLM, an end-to-end embodied agent. We systematically define and evaluate LVLMs’ cross-scene and cross-object navigation generalization for the first time; incorporate chain-of-thought (CoT) reasoning to model interpretable action trajectories; and train the LVLM navigator via unsupervised imitation learning, using BFS-optimal paths as supervisory signals. Contribution/Results: On DivScene, NatVLM achieves a success rate over 20 percentage points higher than GPT-4o, demonstrating substantial improvements in zero-shot generalization to unseen scenes and novel objects. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent. Our code and data are available at https://github.com/zhaowei-wang-nlp/DivScene.

Problem

Research questions and friction points this paper is trying to address.

Addressing open-vocabulary object navigation in diverse scenes

Evaluating LVLMs' embodied environment comprehension and navigation

Improving navigation success rates with fine-tuned LVLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning LVLMs with CoT explanations

Using BFS-generated paths without human supervision

Creating DivScene dataset with diverse scenes

🔎 Similar Papers

Find Everything: A General Vision Language Model Approach to Multi-Object Search