How much do LLMs learn from negative examples?

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The role of negative examples—such as incorrect, suboptimal, or semantically plausible-but-wrong (“near-miss”) responses—in large language model (LLM) alignment training remains poorly understood. Method: We propose a likelihood-ratio (Likra) framework to systematically quantify the impact of negative examples across supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF)/direct preference optimization (DPO) stages, using multiple-choice QA benchmarks. Contribution/Results: We uncover, for the first time, that negative examples—especially near-miss ones—induce abrupt, step-like improvements in learning curves; yield significantly higher per-sample gains than positive-only SFT; and are indispensable for suppressing hallucination and reducing the probability of generating plausible-but-incorrect outputs. These findings establish the critical causal attribution of negative examples in LLM alignment, providing both a novel paradigm and empirical grounding for efficient, robust alignment strategies.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples -- incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.
Problem

Research questions and friction points this paper is trying to address.

Role of negative examples in LLM training
Impact of plausible but incorrect negative examples
Improving accuracy and reducing hallucinations in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses likelihood-ratio model for negative examples
Focuses on plausible but incorrect near-misses
Enhances accuracy by reducing incorrect answers
🔎 Similar Papers
No similar papers found.