🤖 AI Summary
This work addresses the lack of high-quality, semantically coherent, and scalable benchmarks for Open-Vocabulary Object Navigation (OVON) in dynamic real-world scenarios. We introduce the first OVON dataset and generation pipeline supporting interactive objects and scene dynamics. Our method integrates multimodal foundation model–driven scene synthesis, semantic consistency modeling grounded in everyday commonsense knowledge, and Habitat-based simulation. Leveraging 2.5K real-world scanned scenes and 0.9K human-verified interactive objects, we construct SD-OVON-3k/10k—standardized task sets for dynamic OVON evaluation. Key contributions include: (1) the first benchmark explicitly designed for OVON in dynamic environments; (2) cross-domain adaptability between real-to-sim and sim-to-real settings; and (3) full open-sourcing of data, code, and two strong baseline models. Our framework significantly enhances realism and generalization of OVON agents in complex, time-varying environments.
📝 Abstract
We present the Semantics-aware Dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes (SD-OVON). It utilizes pretraining multimodal foundation models to generate infinite unique photo-realistic scene variants that adhere to real-world semantics and daily commonsense for the training and the evaluation of navigation agents, accompanied with a plugin for generating object navigation task episodes compatible to the Habitat simulator. In addition, we offer two pre-generated object navigation task datasets, SD-OVON-3k and SD-OVON-10k, comprising respectively about 3k and 10k episodes of the open-vocabulary object navigation task, derived from the SD-OVON-Scenes dataset with 2.5k photo-realistic scans of real-world environments and the SD-OVON-Objects dataset with 0.9k manually inspected scanned and artist-created manipulatable object models. Unlike prior datasets limited to static environments, SD-OVON covers dynamic scenes and manipulatable objects, facilitating both real-to-sim and sim-to-real robotic applications. This approach enhances the realism of navigation tasks, the training and the evaluation of open-vocabulary object navigation agents in complex settings. To demonstrate the effectiveness of our pipeline and datasets, we propose two baselines and evaluate them along with state-of-the-art baselines on SD-OVON-3k. The datasets, benchmark and source code are publicly available.