Multi-agent Planning using Visual Language Models

📅 2024-08-10

🏛️ European Conference on Artificial Intelligence

📈 Citations: 2

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing multi-agent embodied planning approaches rely on explicit environmental modeling and struggle to jointly process visual and linguistic inputs. To address this, we propose a structured-input-free, end-to-end multi-agent planning framework that takes only a single environment image as input, integrating vision-language models (VLMs) with commonsense knowledge to enable cross-modal perception–planning joint reasoning. Our key contributions are: (1) the first multi-agent collaborative architecture that requires no environmental graphs or symbolic representations; and (2) PG2S—a fully automated planning quality evaluation metric that more accurately captures both plan reasonableness and executability than conventional KAS. Evaluated on the ALFRED benchmark, our method achieves significant improvements in task success rate and planning quality, establishing a novel paradigm for image-driven embodied intelligence.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. For instance, when planning and perception are needed simultaneously, these models often struggle because of difficulties in merging multi-modal information. To address this issue, fine-tuned models are typically employed and trained on specialized data structures representing the environment. This approach has limited effectiveness, as it can overly complicate the context for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input. Instead, it uses a single image of the environment, handling free-form domains by leveraging commonsense knowledge. We also introduce a novel, fully automatic evaluation procedure, PG2S, designed to better assess the quality of a plan. We validated our approach using the widely recognized ALFRED dataset, comparing PG2S to the existing KAS metric to further evaluate the quality of the generated plans.

Problem

Research questions and friction points this paper is trying to address.

Multi-Robot Systems

Complex Problem Solving

Visual and Textual Information Processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Robot Collaborative Planning

PG2S Evaluation Technique

Vision-Based Problem Solving

🔎 Similar Papers

Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments