🤖 AI Summary
Current language models often incur substantial computational costs when leveraging reward signals to enhance reasoning, primarily due to reliance on large training sets and multiple rounds of inference. This work proposes Contrastive Reflection (CORE), a novel approach that, for the first time, extracts concise and interpretable natural language rules by contrasting successful and failed reasoning trajectories to guide efficient self-improvement. CORE operates within a non-parametric learning framework, integrating contrastive analysis, natural language abstraction, and a context-efficient memory mechanism—eliminating the need for model weight updates or elaborate prompt optimization. Evaluated across four reasoning benchmarks, CORE significantly outperforms strong baselines such as GRPO, GEPA, episodic RAG, and MemRL using only five examples and limited inference steps, while simultaneously reducing prompt length and generating human-interpretable knowledge.
📝 Abstract
Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.