DeepFleet: Multi-Agent Foundation Models for Mobile Robots

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses collaborative planning for large-scale mobile robot swarms by proposing a multi-agent foundation model framework. Methodologically, it designs four architectures with distinct inductive biases; the “robot-centric + graph-grounded” model integrates asynchronous state updates and local interaction structures, combining Decision Transformers, cross-attention, multi-channel convolutional encoding, temporal attention, and graph neural networks to jointly model spatiotemporal robot–environment interactions. The primary contribution lies in systematically exploring the design space of multi-agent foundation models and empirically validating the superiority of this hybrid architecture for motion prediction and cooperative planning in warehouse scheduling tasks. Experiments demonstrate consistent performance gains with increasing data scale, confirming strong scalability. This work establishes a novel paradigm for foundation models in robotic swarm intelligence.

Technology Category

Application Category

📝 Abstract

We introduce DeepFleet, a suite of foundation models designed to support coordination and planning for large-scale mobile robot fleets. These models are trained on fleet movement data, including robot positions, goals, and interactions, from hundreds of thousands of robots in Amazon warehouses worldwide. DeepFleet consists of four architectures that each embody a distinct inductive bias and collectively explore key points in the design space for multi-agent foundation models: the robot-centric (RC) model is an autoregressive decision transformer operating on neighborhoods of individual robots; the robot-floor (RF) model uses a transformer with cross-attention between robots and the warehouse floor; the image-floor (IF) model applies convolutional encoding to a multi-channel image representation of the full fleet; and the graph-floor (GF) model combines temporal attention with graph neural networks for spatial relationships. In this paper, we describe these models and present our evaluation of the impact of these design choices on prediction task performance. We find that the robot-centric and graph-floor models, which both use asynchronous robot state updates and incorporate the localized structure of robot interactions, show the most promise. We also present experiments that show that these two models can make effective use of larger warehouses operation datasets as the models are scaled up.

Problem

Research questions and friction points this paper is trying to address.

Develop foundation models for large-scale robot fleet coordination

Evaluate four architectures for multi-agent prediction performance

Optimize robot state updates and interaction localization in models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive decision transformer for robot-centric coordination

Transformer with cross-attention for robot-floor interaction

Graph neural networks for spatial-temporal fleet relationships

🔎 Similar Papers

No similar papers found.