QUILL: Quotation Generation Enhancement of Large Language Models

📅 2024-11-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

Large language models (LLMs) frequently exhibit hallucination and misalignment with human expectations in citation generation. To address this, we propose the first fully automated five-dimensional citation evaluation framework, assessing factual accuracy, relevance, coverage, formatting compliance, and semantic coherence. We construct a high-quality bilingual citation knowledge base comprising 32,022 entries, spanning multiple disciplines and languages. Furthermore, we design a citation-aware semantic-factual joint re-ranking metric that integrates retrieval-augmented generation (RAG), semantic similarity modeling, and factuality verification. Experimental results demonstrate strong correlation between our evaluation scores and human preferences (Spearman’s ρ > 0.85), substantial improvements in LLM citation accuracy and completeness, and a marked reduction in the quality gap between LLM- and human-generated citations. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

While Large language models (LLMs) have become excellent writing assistants, they still struggle with quotation generation. This is because they either hallucinate when providing factual quotations or fail to provide quotes that exceed human expectations. To bridge the gap, we systematically study how to evaluate and improve LLMs' performance in quotation generation tasks. We first establish a holistic and automatic evaluation system for quotation generation task, which consists of five criteria each with corresponding automatic metric. To improve the LLMs' quotation generation abilities, we construct a bilingual knowledge base that is broad in scope and rich in dimensions, containing up to 32,022 quotes. Moreover, guided by our critiria, we further design a quotation-specific metric to rerank the retrieved quotations from the knowledge base. Extensive experiments show that our metrics strongly correlate with human preferences. Existing LLMs struggle to generate desired quotes, but our quotation knowledge base and reranking metric help narrow this gap. Our dataset and code are publicly available at https://github.com/GraceXiaoo/QUILL.

Problem

Research questions and friction points this paper is trying to address.

Enhance quotation generation in large language models

Reduce hallucination in factual quotation generation

Improve quotes to meet or exceed human expectations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed bilingual knowledge base

Designed automatic evaluation system

Implemented quotation-specific reranking metric

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval