FLAME: Learning to Navigate with Multimodal LLM in Urban Environments

📅 2024-08-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
Large language models (LLMs) underperform in urban vision-language navigation (VLN), particularly in embodied navigation requiring joint reasoning over street-view imagery, trajectory sequences, and natural language instructions. To address this, we propose FLAME, a multimodal large language model agent built upon the FLAMingo architecture, featuring a novel three-stage perception tuning framework: (1) unimodal street-view captioning generation, (2) multimodal trajectory semantic compression, and (3) reinforcement-guided end-to-end navigation fine-tuning. We further introduce an automatically synthesized, high-quality augmented dataset. On the Touchdown benchmark, FLAME achieves a 7.3% absolute improvement in task success rate, substantially outperforming prior state-of-the-art methods. This work constitutes the first systematic demonstration that multimodal LLMs—when equipped with staged perceptual alignment and explicit trajectory semantic modeling—exhibit strong efficacy and practical promise for complex urban navigation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for trajectory summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME's superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion rate on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards practical applications of MLLMs in embodied AI. Project page: https://flame-sjtu.github.io
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Multimodal Navigation
Specialized Navigation Tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

FLAME Model
Multimodal Large Language Model
Enhanced Navigation Performance