Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Meta-Experience Learning (MEL), a novel reinforcement learning framework that introduces the concept of meta-experience—inspired by human learning—into large language models to address the challenges of fine-grained credit assignment and internalization of reusable knowledge in reasoning tasks. MEL leverages a self-verification mechanism to contrast correct and incorrect reasoning trajectories, identifies error-inducing decision points, and distills these insights into generalizable meta-experiences. These meta-experiences are then internalized into the model’s parameters through a unified approach combining self-distillation, contrastive analysis, and language modeling–based reward signals. Experimental results demonstrate consistent and significant performance gains across multiple benchmarks, with Pass@1 accuracy improvements ranging from 3.92% to 4.73% across models of varying scales.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM's self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM's parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Large Language Models
Meta-Experience
Error Attribution
Knowledge Reuse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta-Experience Learning
Reinforcement Learning with Verifiable Rewards
experience internalization
contrastive trajectory analysis
parametric memory
🔎 Similar Papers
No similar papers found.