Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

📅 2024-05-29
🏛️ Neural Information Processing Systems
📈 Citations: 10
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe hallucination, lack of content provenance, and low inference efficiency in large language models (LLMs), this paper proposes a token-level retrieval-augmented semi-parametric speculative decoding framework. The method integrates k-nearest-neighbor (kNN) retrieval, hybrid distribution modeling, approximate speculative decoding, and a dynamic prefix acceptance strategy, enabling real-time injection and precise source attribution of arbitrary-length authentic text snippets. Compared to conventional kNN-LMs, our framework significantly improves generation fluency and attribution accuracy—achieving a 32% increase in attribution rate on knowledge-intensive tasks—while attaining 1.8× inference speedup over Llama-2-Chat 70B. It outperforms standard kNN-LM and approaches the performance of context-based retrieval-augmented methods, marking the first work to jointly achieve high-quality generation, interpretable provenance tracing, and efficient inference.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) often hallucinate and lack the ability to provide attribution for their generations. Semi-parametric LMs, such as kNN-LM, approach these limitations by refining the output of an LM for a given prompt using its nearest neighbor matches in a non-parametric data store. However, these models often exhibit slow inference speeds and produce non-fluent texts. In this paper, we introduce Nearest Neighbor Speculative Decoding (NEST), a novel semi-parametric language modeling approach that is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST performs token-level retrieval at each inference step to compute a semi-parametric mixture distribution and identify promising span continuations in a corpus. It then uses an approximate speculative decoding procedure that accepts a prefix of the retrieved span or generates a new token. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks, surpassing the conventional kNN-LM method and performing competitively with in-context retrieval augmentation. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B. Code will be released at https://github.com/facebookresearch/NEST/tree/main.
Problem

Research questions and friction points this paper is trying to address.

Reducing hallucinations in LLM generations
Improving attribution for generated text sources
Enhancing inference speed and text fluency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level retrieval for semi-parametric mixture distribution
Approximate speculative decoding for span acceptance
Enhances generation speed with 1.8x inference speedup
🔎 Similar Papers
No similar papers found.