MALLVI: a multi agent framework for integrated generalized robotics manipulation

πŸ“… 2026-02-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing large language model (LLM)-driven robotic task planners, which are predominantly open-loop and struggle in dynamic environments. To overcome this, we propose MALLVIβ€”a closed-loop, multi-agent collaborative framework that tightly integrates vision and language models through specialized agents: a Decomposer, a Localizer, a Thinker, and a Reflector. This architecture enables seamless coupling of perception, localization, reasoning, planning, and feedback, supporting localized error detection and replanning without requiring global recomputation. As a result, MALLVI significantly enhances robustness and generalization. Extensive evaluations in both simulated and real-world settings demonstrate substantial improvements in zero-shot task success rates, validating the effectiveness of closed-loop multi-agent collaboration for general-purpose robotic manipulation.

Technology Category

Application Category

πŸ“ Abstract
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings.We present MALLVi, a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step.Rather than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning.Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.Code available at https://github.com/iman1234ahmadi/MALLVI.
Problem

Research questions and friction points this paper is trying to address.

robotic manipulation
task planning
large language models
closed-loop feedback
dynamic environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent framework
closed-loop feedback
vision-language model
zero-shot manipulation
error recovery
πŸ”Ž Similar Papers
No similar papers found.
I
Iman Ahmadi
Department of Electrical Engineering, Sharif University of Technology
M
Mehrshad Taji
Department of Electrical Engineering, Sharif University of Technology
A
Arad Mahdinezhad Kashani
Sharif University of Technology
A
AmirHossein Jadidi
Department of Electrical Engineering, Sharif University of Technology
S
Saina Kashani
Department of Electrical Engineering, Sharif University of Technology
Babak Khalaj
Babak Khalaj
Professor of Electrical Engineering
Wireless NetworkingBio Data Analytics