Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently generate mathematically flawed reasoning—plausible yet erroneous outputs exhibiting logical, algebraic, or numerical inconsistencies—while existing post-training methods (e.g., binary correctness classification or RLHF) suffer from high computational cost, poor scalability, and neglect of fine-grained error structure. Method: We propose a lightweight post-training framework: (1) constructing a six-dimensional structured error profile (e.g., logical, algebraic, numerical), annotated via a compact MathVerifier combining symbolic rules and heuristics; (2) mining “nearly correct but defective” hard negative examples; and (3) designing a verifier-guided weighted DPO objective that dynamically weights preference pairs by error severity. Contribution/Results: Our approach requires no large reward model or LLM-as-judge. On Qwen2.5-1.5B, it significantly improves accuracy on problems where answers are numerically close but logically inconsistent—outperforming standard SFT and unweighted DPO—with controllable training overhead.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) continue to struggle with mathematical reasoning, and common post-training pipelines often reduce each generated solution to a binary outcome: correct or incorrect. This perspective is limiting in practice, as failures in chain-of-thought (CoT) reasoning are frequently structured; solutions may appear convincing while containing subtle logical, algebraic, or numerical flaws. Meanwhile, reinforcement learning from human feedback (RLHF) variants that rely on large reward models or LLM-as-a-judge signals are often expensive, difficult to scale, and unstable to iterate. We propose a lightweight and pragmatic post-training pipeline that targets such structured errors under realistic compute budgets. Starting from supervised fine-tuning (SFT) on MetaMathQA-style CoT data, we introduce a compact MathVerifier that decomposes a candidate solution into a six-dimensional error profile and aggregates it into interpretable wrongness and absurdity scores. These verifier signals serve two roles: (i) mining hard negatives that are near-correct yet structurally flawed, and (ii) defining per-sample importance weights that emphasize the most informative preference pairs. We integrate both into an offline Direct Preference Optimization (DPO) objective via a verifier-guided weighted formulation. Experiments on a 1.5B-parameter Qwen2.5 model show that verifier-guided, weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO, particularly on problems where solutions are numerically close to correct but logically inconsistent, while avoiding the overhead of training large reward models or relying on external judges.
Problem

Research questions and friction points this paper is trying to address.

Improves mathematical reasoning in small language models
Targets structured errors in chain-of-thought reasoning
Avoids expensive reward models or external judges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact MathVerifier decomposes solutions into error profiles
Hard negative mining targets near-correct but flawed solutions
Verifier-guided weighted DPO emphasizes informative preference pairs
🔎 Similar Papers
No similar papers found.