ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Current large language models (LLMs) lack explicit, differentiable meta-reasoning mechanisms, limiting their adaptability and controllability in complex reasoning tasks. Method: We propose Meta-RL—the first framework that formalizes meta-reasoning as a multi-agent collaborative learning paradigm. It introduces a hierarchical agent architecture: a meta-reasoning agent performs high-level strategy planning and dynamic regulation, while reasoning agents execute low-level task execution. Joint optimization is achieved via goal-aligned reward shaping, LLM-as-a-Judge evaluation, and chain-of-thought policy distillation. Contribution/Results: Meta-RL overcomes key bottlenecks of monolithic RL frameworks—namely, the intractability of decomposing, aligning, and evolving meta-cognitive capabilities. On mathematical reasoning benchmarks (MATH, AMC) and LLM evaluation benchmarks, it significantly outperforms single-agent RL baselines, demonstrating superior generalization and robustness. Ablation studies empirically validate the evolvability of meta-reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking -- enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Experimental results demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhance LLM performance via meta-thinking integration

Improve meta-thinking acquisition with multi-agent reinforcement learning

Decouple reasoning into hierarchical agents for better problem-solving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Reinforcement Learning for meta-thinking

Hierarchical agents: meta-thinking and reasoning

Iterative reinforcement learning for collaboration

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study