GenProve: Learning to Generate Text with Fine-Grained Provenance

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models frequently suffer from hallucinations, and existing citation mechanisms are typically coarse-grained, making it difficult to verify fine-grained support relationships between generated content and source evidence—particularly for reasoning-based statements. This work introduces the novel task of fine-grained provenance generation during decoding, requiring models to output sentence-level structured triplets that distinguish among cited, compressed, and inferred evidence. To support this task, we construct ReFInE, an expert-annotated dataset that reveals a significant gap in current models’ ability to trace reasoning steps. We propose a training approach combining supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO), using a composite reward to jointly optimize answer faithfulness and provenance accuracy. Our model, GenProve, substantially outperforms 14 strong baselines in joint evaluation, significantly enhancing the verifiability of generated content.

Technology Category

Application Category

📝 Abstract

Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability&Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.

Problem

Research questions and friction points this paper is trying to address.

hallucination

fine-grained provenance

verifiable reasoning

citation

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained Provenance

GenProve

ReFInE

Verifiable Reasoning

Group Relative Policy Optimization

🔎 Similar Papers

No similar papers found.

Authors to Follow