🤖 AI Summary
This work addresses the reliability challenges of large language models in long-form text generation, where factual errors often undermine output trustworthiness. Traditional “all-or-nothing” uncertainty handling strategies frequently lead to excessive information loss. To mitigate this, the authors propose a Selective Abstraction framework, formally framing the problem as a trade-off between risk and coverage. They introduce an atomic-level selective abstraction method that decomposes model outputs into atomic claims, models the risk of factual inaccuracy for each claim, and applies an information-theoretic coverage metric to dynamically reduce the granularity of uncertain content—rather than deleting it outright—while preserving semantic integrity. Experiments demonstrate that this approach improves the Area Under the Risk-Coverage curve (AURC) by up to 27.73% over claim-deletion baselines on the FactScore and LongFact-Objects benchmarks, effectively balancing factual accuracy with information retention.
📝 Abstract
LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary"all-or-nothing"approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.