🤖 AI Summary
To address the challenge of jointly optimizing coherent text generation and precise citation retrieval in academic writing, this paper proposes a retrieval-token-driven dynamic RAG framework. During autoregressive decoding, dynamically inserted [RET] tokens trigger targeted literature retrieval, and retrieved documents are seamlessly integrated into the generation process, enabling end-to-end joint optimization of writing and citation. Key contributions include: (1) the first learnable retrieval token mechanism; (2) a lightweight architecture supporting multi-task joint fine-tuning; and (3) domain-specific pretraining and adaptation on arXiv academic corpora. Experiments show substantial improvements: the method achieves 40.1% Top-1 retrieval accuracy—outperforming E5-Mistral and BM25; attains a human-rated academic writing quality score of 16.2/25, surpassing Qwen-2.5-72B; and human evaluation confirms simultaneous gains in citation recall and writing efficiency.
📝 Abstract
Academic writing requires both coherent text generation and precise citation of relevant literature. Although recent Retrieval-Augmented Generation (RAG) systems have significantly improved factual accuracy in general-purpose text generation, their capacity to adequately support professional academic writing remains limited. In this work, we introduce ScholarCopilot, a unified framework designed to enhance existing large language models for generating professional academic articles with accurate and contextually relevant citations. ScholarCopilot dynamically determines when to retrieve scholarly references by generating a retrieval token [RET], and then utilizes its representation to look up relevant citations from a database. The retrieved references are fed into the model to augment the generation process. We jointly optimize both the generation and citation tasks within a single framework to increase efficiency. Trained on 500K papers from arXiv, our model achieves a top-1 retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselines such as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000 academic writing samples, ScholarCopilot scores 16.2/25 in generation quality (measured across relevance, coherence, academic rigor, completeness, and innovation), surpassing models with 10x more parameters such as Qwen-2.5-72B-Instruct (15.8/25). Human studies also confirm ScholarCopilot's superior performance in citation recall, writing efficiency, and overall user experience, confirming the effectiveness of our approach.