🤖 AI Summary
This work proposes Meta-Experience Learning (MEL), a novel reinforcement learning framework that introduces the concept of meta-experience—inspired by human learning—into large language models to address the challenges of fine-grained credit assignment and internalization of reusable knowledge in reasoning tasks. MEL leverages a self-verification mechanism to contrast correct and incorrect reasoning trajectories, identifies error-inducing decision points, and distills these insights into generalizable meta-experiences. These meta-experiences are then internalized into the model’s parameters through a unified approach combining self-distillation, contrastive analysis, and language modeling–based reward signals. Experimental results demonstrate consistent and significant performance gains across multiple benchmarks, with Pass@1 accuracy improvements ranging from 3.92% to 4.73% across models of varying scales.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM's self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM's parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.