Atomic Consistency Preference Optimization for Long-Form Question Answering

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Large language models (LLMs) suffer from factual hallucinations in long-form question answering, and existing mitigation approaches rely either on GPT-4-based supervision or external knowledge bases—limiting generalizability and accessibility. To address this, we propose a fully self-supervised preference optimization framework that requires no external supervision. Our key innovation is an atomic-fact consistency–based signal construction mechanism: for each question, we generate multiple responses via sampling, extract fine-grained atomic facts from each, and automatically construct high-quality preference pairs by cross-comparing factual consistency. The method integrates multi-sample consistency modeling, self-supervised data filtering, and fact-granular response evaluation, optimized via a DPO variant. On LongFact and BioGen benchmarks, our approach outperforms the supervised baseline FactAlign by +1.95 points, significantly improving factual accuracy and deployment feasibility for long-form QA.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) frequently produce factoid hallucinations - plausible yet incorrect answers. A common mitigation strategy is model alignment, which improves factual accuracy by training on curated factual and non-factual pairs. However, this approach often relies on a stronger model (e.g., GPT-4) or an external knowledge base to assess factual correctness, which may not always be accessible. To address this, we propose Atomic Consistency Preference Optimization (ACPO), a self-supervised preference-tuning method that enhances factual accuracy without external supervision. ACPO leverages atomic consistency signals, i.e., the agreement of individual facts across multiple stochastic responses, to identify high- and low-quality data pairs for model alignment. By eliminating the need for costly GPT calls, ACPO provides a scalable and efficient approach to improving factoid question-answering. Despite being self-supervised, empirical results demonstrate that ACPO outperforms FactAlign, a strong supervised alignment baseline, by 1.95 points on the LongFact and BioGen datasets, highlighting its effectiveness in enhancing factual reliability without relying on external models or knowledge bases.

Problem

Research questions and friction points this paper is trying to address.

Reducing factoid hallucinations in long-form question answering

Improving factual accuracy without external supervision

Enhancing model alignment using self-supervised consistency signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised preference-tuning without external supervision

Leverages atomic consistency signals for data quality

Scalable and efficient alternative to GPT-based alignment

🔎 Similar Papers

Ask Optimal Questions: Aligning Large Language Models with Retriever's Preference in Conversational Search