GenProve: Learning to Generate Text with Fine-Grained Provenance

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models frequently suffer from hallucinations, and existing citation mechanisms are typically coarse-grained, making it difficult to verify fine-grained support relationships between generated content and source evidence—particularly for reasoning-based statements. This work introduces the novel task of fine-grained provenance generation during decoding, requiring models to output sentence-level structured triplets that distinguish among cited, compressed, and inferred evidence. To support this task, we construct ReFInE, an expert-annotated dataset that reveals a significant gap in current models’ ability to trace reasoning steps. We propose a training approach combining supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO), using a composite reward to jointly optimize answer faithfulness and provenance accuracy. Our model, GenProve, substantially outperforms 14 strong baselines in joint evaluation, significantly enhancing the verifiability of generated content.

Technology Category

Application Category

📝 Abstract
Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability&Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.
Problem

Research questions and friction points this paper is trying to address.

hallucination
fine-grained provenance
verifiable reasoning
citation
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained Provenance
GenProve
ReFInE
Verifiable Reasoning
Group Relative Policy Optimization
🔎 Similar Papers
No similar papers found.
Jingxuan Wei
Jingxuan Wei
University of Chinese Academy of Sciences
Natural Language ProcessingMultimodal Learning
X
Xingyue Wang
Shenyang Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Y
Yanghaoyu Liao
Shenyang Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
J
Jie Dong
Shenyang Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Y
Yuchen Liu
C
Caijun Jia
Shenyang Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
B
Bihui Yu
Shenyang Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Junnan Zhu
Junnan Zhu
Institute of Automation Chinese Academy of Sciences
Natural Language Processing