Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

๐Ÿ“… 2026-03-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing multi-agent large language model systems exhibit fragility when confronted with out-of-distribution tasks and lack the capacity for human collaboration and continual learning. This work proposes HILA, a novel framework that, for the first time, integrates metacognitive decision-making with continual learning to construct an open-ended humanโ€“AI collaborative multi-agent system. HILA employs a dual-loop optimization strategy: an inner loop utilizes cost-aware Group Relative Policy Optimization (GRPO) to autonomously decide whether to request human assistance, while an outer loop leverages human feedback to generate high-quality supervision signals for long-term capability enhancement. Experimental results demonstrate that HILA significantly outperforms state-of-the-art methods on complex mathematical and problem-solving benchmarks, effectively improving both collaborative efficiency and the systemโ€™s capacity for sustained evolution.

Technology Category

Application Category

๐Ÿ“ Abstract
While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain''closed-world''systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.
Problem

Research questions and friction points this paper is trying to address.

multi-agent systems
human-in-the-loop
continual learning
metacognitive policy
LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

metacognitive policy
human-in-the-loop
multi-agent LLMs
continual learning
dual-loop policy optimization
๐Ÿ”Ž Similar Papers
No similar papers found.