🤖 AI Summary
Distributed multi-agent systems face significant challenges in collaboratively interpreting vision-language instructions and solving complex tasks within open, dynamic environments. Method: This paper introduces VL-DCOPs—a novel framework for Vision-Language Distributed Constraint Optimization Problems—establishing the first vision-language-driven DCOP modeling paradigm. It formally defines VL-DCOP tasks and designs a modular, plug-and-play multimodal agent spectrum supporting both neuro-symbolic and fully neural architectures. The framework integrates multimodal large models (VLMs/LLMs), joint vision-language understanding and generation, neuro-symbolic reasoning, and distributed constraint optimization techniques. Contribution/Results: Extensive experiments on three newly proposed VL-DCOP task classes demonstrate that VL-DCOPs substantially reduces manual effort in constraint modeling, improves task generalization and environmental adaptability, and effectively extends the applicability boundary of DCOPs to real-world open-domain scenarios.
📝 Abstract
Distributed Constraint Optimization Problems (DCOPs) offer a powerful framework for multi-agent coordination but often rely on labor-intensive, manual problem construction. To address this, we introduce VL-DCOPs, a framework that takes advantage of large multimodal foundation models (LFMs) to automatically generate constraints from both visual and linguistic instructions. We then introduce a spectrum of agent archetypes for solving VL-DCOPs: from a neuro-symbolic agent that delegates some of the algorithmic decisions to an LFM, to a fully neural agent that depends entirely on an LFM for coordination. We evaluate these agent archetypes using state-of-the-art LLMs (large language models) and VLMs (vision language models) on three novel VL-DCOP tasks and compare their respective advantages and drawbacks. Lastly, we discuss how this work extends to broader frontier challenges in the DCOP literature.