rSIM: Incentivizing Reasoning Capabilities of LLMs via Reinforced Strategy Injection

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Large language models (LLMs) suffer from limited reasoning capabilities due to reliance on static, hand-crafted chain-of-thought (CoT) prompting. Method: We propose rSIM, a strategy injection framework based on multi-agent reinforcement learning, featuring a leader–follower architecture. A lightweight, transferable planner—trained once—dynamically guides the LLM to generate high-quality, self-reflective reasoning paths; it synergistically optimizes CoT generation via a rule-driven reward mechanism. The planner is plug-and-play across tasks and supports continual learning and generalization. Results: Empirical evaluation shows that Qwen2.5-0.5B augmented with rSIM significantly outperforms the much larger Qwen2.5-14B in reasoning benchmarks, demonstrating that small-scale LLMs can achieve substantial reasoning capability gains through strategic reinforcement learning–based policy injection—validating both effectiveness and scalability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are post-trained through reinforcement learning (RL) to evolve into Reasoning Language Models (RLMs), where the hallmark of this advanced reasoning is ``aha'' moments when they start to perform strategies, such as self-reflection and deep thinking, within chain of thoughts (CoTs). Motivated by this, this paper proposes a novel reinforced strategy injection mechanism (rSIM), that enables any LLM to become an RLM by employing a small planner to guide the LLM's CoT through the adaptive injection of reasoning strategies. To achieve this, the planner (leader agent) is jointly trained with an LLM (follower agent) using multi-agent RL (MARL), based on a leader-follower framework and straightforward rule-based rewards. Experimental results show that rSIM enables Qwen2.5-0.5B to become an RLM and significantly outperform Qwen2.5-14B. Moreover, the planner is generalizable: it only needs to be trained once and can be applied as a plug-in to substantially improve the reasoning capabilities of existing LLMs. In addition, the planner supports continual learning across various tasks, allowing its planning abilities to gradually improve and generalize to a wider range of problems.

Problem

Research questions and friction points this paper is trying to address.

Enhances LLMs' reasoning via reinforced strategy injection

Trains a planner with LLM using multi-agent reinforcement learning

Generalizes planner across tasks to boost existing LLMs' reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforced strategy injection via small planner

Multi-agent RL training leader-follower framework

Generalizable plug-in planner for continual learning

🔎 Similar Papers

Self-playing Adversarial Language Game Enhances LLM Reasoning