SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the lack of high-quality, semantically coherent, and scalable benchmarks for Open-Vocabulary Object Navigation (OVON) in dynamic real-world scenarios. We introduce the first OVON dataset and generation pipeline supporting interactive objects and scene dynamics. Our method integrates multimodal foundation model–driven scene synthesis, semantic consistency modeling grounded in everyday commonsense knowledge, and Habitat-based simulation. Leveraging 2.5K real-world scanned scenes and 0.9K human-verified interactive objects, we construct SD-OVON-3k/10k—standardized task sets for dynamic OVON evaluation. Key contributions include: (1) the first benchmark explicitly designed for OVON in dynamic environments; (2) cross-domain adaptability between real-to-sim and sim-to-real settings; and (3) full open-sourcing of data, code, and two strong baseline models. Our framework significantly enhances realism and generalization of OVON agents in complex, time-varying environments.

Technology Category

Application Category

📝 Abstract

We present the Semantics-aware Dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes (SD-OVON). It utilizes pretraining multimodal foundation models to generate infinite unique photo-realistic scene variants that adhere to real-world semantics and daily commonsense for the training and the evaluation of navigation agents, accompanied with a plugin for generating object navigation task episodes compatible to the Habitat simulator. In addition, we offer two pre-generated object navigation task datasets, SD-OVON-3k and SD-OVON-10k, comprising respectively about 3k and 10k episodes of the open-vocabulary object navigation task, derived from the SD-OVON-Scenes dataset with 2.5k photo-realistic scans of real-world environments and the SD-OVON-Objects dataset with 0.9k manually inspected scanned and artist-created manipulatable object models. Unlike prior datasets limited to static environments, SD-OVON covers dynamic scenes and manipulatable objects, facilitating both real-to-sim and sim-to-real robotic applications. This approach enhances the realism of navigation tasks, the training and the evaluation of open-vocabulary object navigation agents in complex settings. To demonstrate the effectiveness of our pipeline and datasets, we propose two baselines and evaluate them along with state-of-the-art baselines on SD-OVON-3k. The datasets, benchmark and source code are publicly available.

Problem

Research questions and friction points this paper is trying to address.

Generates realistic dynamic scenes for open-vocabulary object navigation training

Provides datasets and tools compatible with Habitat simulator for navigation tasks

Enhances realism in training and evaluating navigation agents in complex environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multimodal models for realistic scene generation

Generates dynamic scenes with manipulatable objects

Compatible with Habitat simulator via plugin

🔎 Similar Papers

Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking

2024-10-02arXiv.orgCitations: 0