π€ AI Summary
Large language models (LLMs) frequently exhibit hallucination and misalignment with human expectations in citation generation. To address this, we propose the first fully automated five-dimensional citation evaluation framework, assessing factual accuracy, relevance, coverage, formatting compliance, and semantic coherence. We construct a high-quality bilingual citation knowledge base comprising 32,022 entries, spanning multiple disciplines and languages. Furthermore, we design a citation-aware semantic-factual joint re-ranking metric that integrates retrieval-augmented generation (RAG), semantic similarity modeling, and factuality verification. Experimental results demonstrate strong correlation between our evaluation scores and human preferences (Spearmanβs Ο > 0.85), substantial improvements in LLM citation accuracy and completeness, and a marked reduction in the quality gap between LLM- and human-generated citations. The code and dataset are publicly released.
π Abstract
While Large language models (LLMs) have become excellent writing assistants, they still struggle with quotation generation. This is because they either hallucinate when providing factual quotations or fail to provide quotes that exceed human expectations. To bridge the gap, we systematically study how to evaluate and improve LLMs' performance in quotation generation tasks. We first establish a holistic and automatic evaluation system for quotation generation task, which consists of five criteria each with corresponding automatic metric. To improve the LLMs' quotation generation abilities, we construct a bilingual knowledge base that is broad in scope and rich in dimensions, containing up to 32,022 quotes. Moreover, guided by our critiria, we further design a quotation-specific metric to rerank the retrieved quotations from the knowledge base. Extensive experiments show that our metrics strongly correlate with human preferences. Existing LLMs struggle to generate desired quotes, but our quotation knowledge base and reranking metric help narrow this gap. Our dataset and code are publicly available at https://github.com/GraceXiaoo/QUILL.