๐ค AI Summary
This work addresses the limited multilingual instruction understanding in vision-language navigation (VLN), introducing Arabic to the VLN domain for the first time. Building upon the NavGPT pure-LLM framework, we conduct zero-shot, multilingual high-level planning evaluation on the R2R dataset, benchmarking GPT-4o mini, Llama 3 8B, Phi-3 medium 14B, and the Arabic-specialized Jais model on EnglishโArabic bilingual navigation reasoning. We propose the first systematic Arabic VLN evaluation benchmark, revealing pervasive parsing failures and reasoning degradation of multilingual LLMs in non-English navigation tasks. Although Jais achieves the best performance, substantial room for improvement remains. This study fills a critical gap in Arabic VLN research and establishes a new language-aware benchmark and empirical foundation for linguistically grounded robotic planning.
๐ Abstract
Large Language Models (LLMs) such as GPT-4, trained on huge amount of datasets spanning multiple domains, exhibit significant reasoning, understanding, and planning capabilities across various tasks. This study presents the first-ever work in Arabic language integration within the Vision-and-Language Navigation (VLN) domain in robotics, an area that has been notably underexplored in existing research. We perform a comprehensive evaluation of state-of-the-art multi-lingual Small Language Models (SLMs), including GPT-4o mini, Llama 3 8B, and Phi-3 medium 14B, alongside the Arabic-centric LLM, Jais. Our approach utilizes the NavGPT framework, a pure LLM-based instruction-following navigation agent, to assess the impact of language on navigation reasoning through zero-shot sequential action prediction using the R2R dataset. Through comprehensive experiments, we demonstrate that our framework is capable of high-level planning for navigation tasks when provided with instructions in both English and Arabic. However, certain models struggled with reasoning and planning in the Arabic language due to inherent limitations in their capabilities, sub-optimal performance, and parsing issues. These findings highlight the importance of enhancing planning and reasoning capabilities in language models for effective navigation, emphasizing this as a key area for further development while also unlocking the potential of Arabic-language models for impactful real-world applications.