Benchmarking Interaction, Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Existing Collaborative Instance Navigation (CoIN) benchmarks evaluate only navigation success rates, overlooking the need for independent and reproducible assessment of collaborative question-answering interaction capabilities. To address this gap, this work proposes QAsk-Nav—the first CoIN benchmark enabling decoupled evaluation of navigation and questioning abilities—featuring a lightweight questioning protocol, high-quality target descriptions, and an open-source dataset comprising 28,000 quality-verified reasoning and questioning trajectories. Building upon this benchmark, we introduce Light-CoNav, an end-to-end model that integrates visual observations with natural language dialogue. Light-CoNav significantly outperforms existing methods on unseen objects and environments while achieving a threefold reduction in model size and a 70× speedup in inference, thereby demonstrating the effectiveness and advancement of the proposed QAsk-Nav framework.

Technology Category

Application Category

📝 Abstract

We propose Question-Asking Navigation (QAsk-Nav), the first reproducible benchmark for Collaborative Instance Object Navigation (CoIN) that enables an explicit, separate assessment of embodied navigation and collaborative question asking. CoIN tasks an embodied agent with reaching a target specified in free-form natural language under partial observability, using only egocentric visual observations and interactive natural-language dialogue with a human, where the dialogue can help to resolve ambiguity among visually similar object instances. Existing CoIN benchmarks are primarily focused on navigation success and offer no support for consistent evaluation of collaborative interaction. To address this limitation, QAsk-Nav provides (i) a lightweight question-asking protocol scored independently of navigation, (ii) an enhanced navigation protocol with realistic, diverse, high-quality target descriptions, and (iii) an open-source dataset, that includes 28,000 quality-checked reasoning and question-asking traces for training and analysis of interactive capabilities of CoIN models. Using the proposed QAsk-Nav benchmark, we develop Light-CoNav, a lightweight unified model for collaborative navigation that is 3x smaller and 70x faster than existing modular methods, while outperforming state-of-the-art CoIN approaches in generalization to unseen objects and environments. Project page at https://benchmarking-interaction.github.io/

Problem

Research questions and friction points this paper is trying to address.

Collaborative Instance Object Navigation

Question-Asking

Embodied AI

Interactive Dialogue

Benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative Instance Object Navigation

Question-Asking Protocol

Reproducible Benchmark

Embodied AI

Interactive Dialogue

🔎 Similar Papers

Find Everything: A General Vision Language Model Approach to Multi-Object Search

2024-10-01arXiv.orgCitations: 1