🤖 AI Summary
Large language models (LLMs) suffer from factual hallucinations in long-form question answering, and existing mitigation approaches rely either on GPT-4-based supervision or external knowledge bases—limiting generalizability and accessibility. To address this, we propose a fully self-supervised preference optimization framework that requires no external supervision. Our key innovation is an atomic-fact consistency–based signal construction mechanism: for each question, we generate multiple responses via sampling, extract fine-grained atomic facts from each, and automatically construct high-quality preference pairs by cross-comparing factual consistency. The method integrates multi-sample consistency modeling, self-supervised data filtering, and fact-granular response evaluation, optimized via a DPO variant. On LongFact and BioGen benchmarks, our approach outperforms the supervised baseline FactAlign by +1.95 points, significantly improving factual accuracy and deployment feasibility for long-form QA.
📝 Abstract
Large Language Models (LLMs) frequently produce factoid hallucinations - plausible yet incorrect answers. A common mitigation strategy is model alignment, which improves factual accuracy by training on curated factual and non-factual pairs. However, this approach often relies on a stronger model (e.g., GPT-4) or an external knowledge base to assess factual correctness, which may not always be accessible. To address this, we propose Atomic Consistency Preference Optimization (ACPO), a self-supervised preference-tuning method that enhances factual accuracy without external supervision. ACPO leverages atomic consistency signals, i.e., the agreement of individual facts across multiple stochastic responses, to identify high- and low-quality data pairs for model alignment. By eliminating the need for costly GPT calls, ACPO provides a scalable and efficient approach to improving factoid question-answering. Despite being self-supervised, empirical results demonstrate that ACPO outperforms FactAlign, a strong supervised alignment baseline, by 1.95 points on the LongFact and BioGen datasets, highlighting its effectiveness in enhancing factual reliability without relying on external models or knowledge bases.