Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data

📅 2024-04-05

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

163K/year

🤖 AI Summary

To address the insufficient verifiability of large language model (LLM) outputs, this paper proposes Quote-Tuning—a paradigm that compels models to verbatim quote original statements from trusted pretraining corpora, enabling zero-cost human verification. Methodologically: (1) we design an efficient membership inference function for millisecond-level quote detection; (2) we formulate a quantitative quoting reward and curate a specialized preference dataset emphasizing faithful citation; (3) we integrate these components into a quoting-aware alignment pipeline via Reinforcement Learning from Human Feedback (RLHF). Experiments demonstrate that Quote-Tuning increases high-quality document quoting frequency by 130% across diverse tasks without degrading response quality, generalizes robustly across domains and model families, and concurrently improves factual accuracy—marking the first work to natively embed verifiability as an intrinsic capability of LLMs.

Technology Category

Application Category

📝 Abstract

To trust the fluent generations of large language models (LLMs), humans must be able to verify their correctness against trusted, external sources. Recent efforts, such as providing citations via retrieved documents or post-hoc provenance, enhance verifiability but provide no guarantees on their correctness. To address these limitations, we tackle the verifiability goal with a different philosophy: trivializing the verification process by developing models that quote verbatim statements from trusted sources in their pre-training data. We propose Quote-Tuning, which demonstrates the feasibility of aligning models to quote. The core of Quote-Tuning is a fast membership inference function that efficiently verifies text against trusted corpora. We leverage this tool to design a reward function to quantify quotes in model responses, and curate datasets for preference learning. Experiments show that Quote-Tuning significantly increases verbatim quotes from high-quality documents by up to 130% relative to base models while maintaining response quality. Quote-Tuning is applicable in different tasks, generalizes to out-of-domain data and diverse model families, and provides additional benefits to truthfulness. Our method not only serves as a hassle-free method to increase quoting but also opens up avenues for improving LLM trustworthiness through better verifiability.

Problem

Research questions and friction points this paper is trying to address.

Aligning language models to quote

Enhancing verifiability of model outputs

Improving trustworthiness through verbatim quoting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quote-Tuning aligns models

Membership inference verifies text

Reward function quantifies quotes

🔎 Similar Papers

No similar papers found.