Only Say What You Know: Calibration-Aware Generation for Long-Form Factuality

πŸ“… 2026-05-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

186K/year
πŸ€– AI Summary
This work addresses the challenge of hallucination in large reasoning models during long-form text generation, where errors compound over multi-step inference and existing approaches lack fine-grained factual control. The authors propose a decoupled exploration-and-commitment paradigm implemented through an end-to-end Calibration-Aware Generation (CAG) framework. CAG dynamically evaluates the reliability of intermediate reasoning steps during generation and preferentially incorporates high-confidence content. This approach enables, for the first time, fine-grained factual calibration within the generation process itself, significantly enhancing the model’s self-awareness and output reliability. Experimental results across five long-text factuality benchmarks demonstrate up to a 13% improvement in factual accuracy and a reduction in decoding time of up to 37%.
πŸ“ Abstract
Large Reasoning Models achieve strong performance on complex tasks but remain prone to hallucinations, particularly in long-form generation where errors compound across reasoning steps. Existing approaches to improving factuality, including abstention and factuality-driven optimization, follow a \emph{coupled exploration-commitment} paradigm, in which intermediate reasoning is unconditionally propagated to the final output, limiting fine-grained control over information selection and integration. In this paper, we propose an \textbf{Exploration-Commitment Decoupling} paradigm that disentangles knowledge exploration from final commitment, enabling models to explore with awareness while answering cautiously. We instantiate the paradigm with \textbf{Calibration-Aware Generation (CAG)}, a framework that equips models with end-to-end, calibration-aware generation capabilities, by augmenting intermediate reasoning with calibrated reliability estimates and prioritizing reliable content in final outputs. Across five long-form factuality benchmarks and multiple model families, CAG improves factuality by up to 13%, while reducing decoding time by up to 37%. Overall, our work highlights decoupling as a principled approach for more reliable long-form generation, offering directions for trustworthy and self-aware generative systems.
Problem

Research questions and friction points this paper is trying to address.

hallucination
long-form generation
factuality
reasoning models
calibration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploration-Commitment Decoupling
Calibration-Aware Generation
Long-Form Factuality
Hallucination Mitigation
Reliability Estimation
πŸ”Ž Similar Papers
No similar papers found.