Subtle Errors Matter: Preference Learning via Error-injected Self-editing

📅 2024-10-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

137K/year

🤖 AI Summary

Large language models (LLMs) face performance bottlenecks in mathematical reasoning due to subtle errors—e.g., arithmetic miscalculations or symbol missubstitutions—that existing preference learning methods struggle to detect and correct. To address this, we propose an error-injection-based self-editing preference learning framework: LLMs autonomously inject controlled errors by perturbing critical tokens, thereby constructing two complementary hard-to-distinguish preference pairs—(correct, erroneous) and (self-edited, original solution)—enabling fine-grained error awareness. Our method requires no human annotation or complex sampling and directly drives Direct Preference Optimization (DPO) training. With only 4.5K samples, it improves accuracy by 3.0% on GSM8K and 7.9% on MATH; it further generalizes to logical reasoning and code generation tasks. The approach significantly enhances model robustness against latent errors and improves self-correctability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have exhibited strong mathematical reasoning prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle yet critical errors, such as miscalculations or incorrect substitutions, limit the LLMs' full potential. Existing studies to improve mathematical ability typically involve applying preference learning to step-wise solution pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook critical subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into pivotal tokens in reasoning or computation steps to construct hard pairs for error mitigation. In detail, RISE uses the LLM itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples. Moreover, the effect of error mitigation extends from mathematical reasoning to logical reasoning and code generation.

Problem

Research questions and friction points this paper is trying to address.

Addresses subtle errors in LLM reasoning

Enhances mathematical problem-solving in LLMs

Proposes error-injected self-editing for error mitigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Error-injected Self-Editing framework

Self-edited solutions for training

Subtle error-aware DPO training

🔎 Similar Papers

Preference Consistency Matters: Enhancing Preference Learning in Language Models with Automated Self-Curation of Training Corpora