Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address the challenges of highly deceptive negative samples and poor alignment in large language model (LLM) hallucination detection, this paper proposes HaluCheck. First, it constructs high-fidelity synthetic hallucinated texts to serve as high-quality negative samples. Second, it introduces a hallucination difficulty quantification and curriculum scheduling mechanism based on probabilistic decay across multiple fact-checking models, guiding Direct Preference Optimization (DPO) for progressive alignment. Third, it dynamically adjusts training difficulty via curriculum learning to enhance training stability and generalization. Evaluated on challenging benchmarks—including MedHallu and HaluEval—HaluCheck achieves up to 24% relative improvement over prior methods. Notably, its zero-shot performance surpasses that of significantly larger state-of-the-art models. This work is the first to jointly enable (i) quantifiable hallucination difficulty, (ii) curriculum-based DPO alignment, and (iii) high-quality synthetic negative sample optimization.

Technology Category

Application Category

📝 Abstract

Aligning large language models (LLMs) to accurately detect hallucinations remains a significant challenge due to the sophisticated nature of hallucinated text. Recognizing that hallucinated samples typically exhibit higher deceptive quality than traditional negative samples, we use these carefully engineered hallucinations as negative examples in the DPO alignment procedure. Our method incorporates a curriculum learning strategy, gradually transitioning the training from easier samples, identified based on the greatest reduction in probability scores from independent fact checking models, to progressively harder ones. This structured difficulty scaling ensures stable and incremental learning. Experimental evaluation demonstrates that our HaluCheck models, trained with curriculum DPO approach and high quality negative samples, significantly improves model performance across various metrics, achieving improvements of upto 24% on difficult benchmarks like MedHallu and HaluEval. Additionally, HaluCheck models demonstrate robustness in zero-shot settings, significantly outperforming larger state-of-the-art models across various benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in large language models accurately

Using synthetic negatives to improve DPO alignment

Implementing curriculum learning for stable incremental improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses engineered hallucinations as DPO negative samples

Implements curriculum learning with difficulty scaling

Achieves robustness in zero-shot hallucination detection

🔎 Similar Papers

No similar papers found.