🤖 AI Summary
In high-stakes domains (e.g., healthcare, law, finance), large language models (LLMs) must provide human-verifiable citations to ensure factual reliability and accountability. This work systematically compares two citation paradigms—generation-time citation (G-Cite) and post-hoc citation (P-Cite)—identifying retrieval quality as the primary determinant of attribution accuracy. Within a unified framework spanning zero-shot to retrieval-augmented settings, we conduct multi-scenario experiments across four major attribution benchmarks, complemented by human evaluation and automated analysis. Results show that P-Cite achieves superior trade-offs between citation coverage and correctness, making it better suited for high-risk applications requiring broad, reliable grounding. In contrast, G-Cite attains higher precision but suffers from lower coverage and higher latency, rendering it appropriate for stringent fact-checking tasks where verifiability outweighs breadth. To our knowledge, this is the first empirically grounded, application-aware guideline for selecting citation strategies in LLM-based systems.
📝 Abstract
Trustworthy Large Language Models (LLMs) must cite human-verifiable sources in high-stakes domains such as healthcare, law, academia, and finance, where even small errors can have severe consequences. Practitioners and researchers face a choice: let models generate citations during decoding, or let models draft answers first and then attach appropriate citations. To clarify this choice, we introduce two paradigms: Generation-Time Citation (G-Cite), which produces the answer and citations in one pass, and Post-hoc Citation (P-Cite), which adds or verifies citations after drafting. We conduct a comprehensive evaluation from zero-shot to advanced retrieval-augmented methods across four popular attribution datasets and provide evidence-based recommendations that weigh trade-offs across use cases. Our results show a consistent trade-off between coverage and citation correctness, with retrieval as the main driver of attribution quality in both paradigms. P-Cite methods achieve high coverage with competitive correctness and moderate latency, whereas G-Cite methods prioritize precision at the cost of coverage and speed. We recommend a retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings such as strict claim verification. Our codes and human evaluation results are available at https://anonymous.4open.science/r/Citation_Paradigms-BBB5/