MiniVLA-Nav v1: A Multi-Scene Simulation Dataset for Language-Conditioned Robot Navigation

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study addresses the challenge of language-guided robot navigation to target objects with precise stopping across diverse environments. To this end, we introduce the first simulation dataset built on the Isaac Sim platform that supports multiple scenes, distance scales, and varied linguistic templates. The dataset encompasses four photorealistic environments and provides synchronized RGB images, depth maps, instance segmentation masks, and expert action labels—tokenized into a 7×7 discrete action space—at a high frequency of 60 Hz. It includes 1,174 navigation episodes and incorporates an out-of-distribution (OOD) evaluation protocol to enable multidimensional assessment, including instruction robustness and cross-category generalization. Experimental results demonstrate a strong correlation (Pearson r = 0.94) between trajectory length and initial distance to the target, confirming the dataset’s validity and inherent difficulty.

📝 Abstract

We present MiniVLA-Nav v1, a simulation dataset for Language-Conditioned Object Approach (LCOA) navigation: given a short natural-language instruction, an NVIDIA Nova Carter differential-drive robot must navigate to the named object and stop within 1 m across four photorealistic Isaac Sim environments (Office, Hospital, Full Warehouse, and Warehouse with Multiple Shelves). Each of the 1,174 episodes pairs an instruction with synchronized 640x640 RGB images, metric depth maps (float32, metres), and instance segmentation masks, together with continuous (v,omega) and 7x7 tokenized expert action labels recorded at 60 Hz from a vision-based proportional controller. Trajectory diversity is ensured through three spawn-distance tiers (near: 1.5-3.5 m, mid: 3.5-7.0 m, far: global curated points; Pearson r=0.94 between spawn distance and trajectory length), 12 object categories, 18 training templates, and 12 paraphrase-OOD templates. Five evaluation splits support in-distribution accuracy, template-paraphrase robustness, and OOD object-category benchmarking. The dataset is publicly available at https://huggingface.co/datasets/alibustami/miniVLA-Nav

Problem

Research questions and friction points this paper is trying to address.

Language-Conditioned Navigation

Robot Navigation

Simulation Dataset

Object Approach

Multi-Scene

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-Conditioned Navigation

Simulation Dataset

Multi-Modal Perception