🤖 AI Summary
This study addresses key challenges in collaborative embodied AI—namely, low multi-agent cooperation efficiency, insufficient LLM-driven communication and reasoning, and poor naturalness of human–machine interaction—by proposing an enhanced CoELA framework. Methodologically, it introduces a hierarchical prompt engineering strategy tailored for task coordination, integrating role-aware agent instructions, shared memory mechanisms, and dynamic context management; notably, it pioneers end-to-end speech interaction (ASR/TTS) to enable real-time, voice-driven collaborative decision-making. Experiments atop open-source LLMs (e.g., Gemma-3) demonstrate a 22% improvement in task completion efficiency over the original CoELA, alongside显著 gains in user immersion and system iteration speed. Core contributions include: (1) a scalable, LLM-based multi-agent prompting paradigm; (2) a speech-augmented embodied collaboration architecture; and (3) lightweight deployment validation in realistic scenarios.
📝 Abstract
The integration of Large Language Models (LLMs) into multiagent systems has opened new possibilities for collaborative reasoning and cooperation with AI agents. This paper explores different prompting methods and evaluates their effectiveness in enhancing agent collaborative behaviour and decision-making. We enhance CoELA, a framework designed for building Collaborative Embodied Agents that leverage LLMs for multi-agent communication, reasoning, and task coordination in shared virtual spaces. Through systematic experimentation, we examine different LLMs and prompt engineering strategies to identify optimised combinations that maximise collaboration performance. Furthermore, we extend our research by integrating speech capabilities, enabling seamless collaborative voice-based interactions. Our findings highlight the effectiveness of prompt optimisation in enhancing collaborative agent performance; for example, our best combination improved the efficiency of the system running with Gemma3 by 22% compared to the original CoELA system. In addition, the speech integration provides a more engaging user interface for iterative system development and demonstrations.