How much do LLMs learn from negative examples?

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

The role of negative examples—such as incorrect, suboptimal, or semantically plausible-but-wrong (“near-miss”) responses—in large language model (LLM) alignment training remains poorly understood. Method: We propose a likelihood-ratio (Likra) framework to systematically quantify the impact of negative examples across supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF)/direct preference optimization (DPO) stages, using multiple-choice QA benchmarks. Contribution/Results: We uncover, for the first time, that negative examples—especially near-miss ones—induce abrupt, step-like improvements in learning curves; yield significantly higher per-sample gains than positive-only SFT; and are indispensable for suppressing hallucination and reducing the probability of generating plausible-but-incorrect outputs. These findings establish the critical causal attribution of negative examples in LLM alignment, providing both a novel paradigm and empirical grounding for efficient, robust alignment strategies.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples -- incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.

Problem

Research questions and friction points this paper is trying to address.

Role of negative examples in LLM training

Impact of plausible but incorrect negative examples

Improving accuracy and reducing hallucinations in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses likelihood-ratio model for negative examples

Focuses on plausible but incorrect near-misses

Enhances accuracy by reducing incorrect answers

🔎 Similar Papers

No similar papers found.