๐ค AI Summary
To address the inherent hallucination problem in large language models (LLMs)โi.e., generating plausible yet unsupported content in long-text generationโthis paper introduces the Precise Information Control (PIC) task: models must generate extended text *exclusively* from a set of verifiable, self-contained short statements, prohibiting any information not entailed by the statements. Our contributions are threefold: (1) the first formalization of PIC under full and partial statement coverage; (2) the construction of PIC-Bench, the first benchmark spanning eight diverse downstream tasks; and (3) a weakly supervised, preference-data-driven post-training framework integrating statement grounding, F1-oriented precision evaluation, and instruction tuning. Experiments show that PIC-LM achieves 91.0% F1 (+21.9%) under full PIC settings, improves precise recall in ambiguous question answering by 17.1%, and boosts factual accuracy in birthplace verification by 30.5%.
๐ Abstract
A central challenge in modern language models (LMs) is intrinsic hallucination: the generation of information that is plausible but unsubstantiated relative to input context. To study this problem, we propose Precise Information Control (PIC), a new task formulation that requires models to generate long-form outputs grounded in a provided set of short self-contained statements, known as verifiable claims, without adding any unsupported ones. For comprehensiveness, PIC includes a full setting that tests a model's ability to include exactly all input claims, and a partial setting that requires the model to selectively incorporate only relevant claims. We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. Our evaluation of a range of open and proprietary LMs on PIC-Bench reveals that, surprisingly, state-of-the-art LMs still intrinsically hallucinate in over 70% of outputs. To alleviate this lack of faithfulness, we introduce a post-training framework, using a weakly supervised preference data construction method, to train an 8B PIC-LM with stronger PIC ability--improving from 69.1% to 91.0% F1 in the full PIC setting. When integrated into end-to-end factual generation pipelines, PIC-LM improves exact match recall by 17.1% on ambiguous QA with retrieval, and factual precision by 30.5% on a birthplace verification task, underscoring the potential of precisely grounded generation.