CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

📅 2025-12-21

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Large language models (LLMs) often produce correct answers to mathematical problems without genuinely understanding or transferring underlying mathematical concepts; existing reinforcement learning with verifiable rewards (RLVR) methods supervise only final answers, lacking fine-grained conceptual guidance. Method: We propose CORE, a concept-oriented reinforcement learning framework featuring three novel components: (1) concept-aligned test synthesis, (2) concept snippet injection into reasoning traces, and (3) group-failure–based trajectory replacement regularization—enabling closed-loop training from conceptual mastery to application. CORE integrates concept-injected prompting, trajectory-level forward KL regularization, GRPO optimization, and concept–problem association modeling, supporting diverse LLMs and verifiers. Contribution/Results: On domain-specific concept evaluation suites and cross-domain benchmarks (MATH, AMC), CORE significantly outperforms supervised fine-tuning (SFT) and state-of-the-art RL baselines, achieving the first demonstration of controllable, concept-level supervision and improved generalization.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce CORE (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. CORE then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks. CORE unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.

Problem

Research questions and friction points this paper is trying to address.

Bridges the gap between definition memorization and genuine concept application in mathematical reasoning.

Provides fine-grained conceptual supervision to enhance models' understanding beyond pattern reuse.

Unifies concept-aligned quiz training and concept-injected rollouts under outcome regularization.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept-aligned quiz synthesis for fine-grained supervision

Concept snippet injection during rollouts to prime trajectories

Trajectory replacement with forward-KL constraint for policy alignment

🔎 Similar Papers

BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts