🤖 AI Summary
The role of negative examples—such as incorrect, suboptimal, or semantically plausible-but-wrong (“near-miss”) responses—in large language model (LLM) alignment training remains poorly understood.
Method: We propose a likelihood-ratio (Likra) framework to systematically quantify the impact of negative examples across supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF)/direct preference optimization (DPO) stages, using multiple-choice QA benchmarks.
Contribution/Results: We uncover, for the first time, that negative examples—especially near-miss ones—induce abrupt, step-like improvements in learning curves; yield significantly higher per-sample gains than positive-only SFT; and are indispensable for suppressing hallucination and reducing the probability of generating plausible-but-incorrect outputs. These findings establish the critical causal attribution of negative examples in LLM alignment, providing both a novel paradigm and empirical grounding for efficient, robust alignment strategies.
📝 Abstract
Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples -- incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.