Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address time-consuming literature review, incomplete coverage, and inaccurate citation in academic writing, this paper proposes a dynamic Retrieval-Augmented Generation (RAG) system tailored for arXiv. Methodologically, it introduces the first streaming-updatable dynamic RAG architecture; designs a multi-stage semantic filtering and summarization co-refinement mechanism to mitigate LLM hallucination; and implements multi-granularity summarization, incremental indexing, and plug-and-play support for multiple LLM backends. Experiments demonstrate 92.4% citation accuracy and 98% coverage of arXiv papers published within the past five years in real-world scenarios. The system is publicly available as an open-source web platform (citegeist.org) and a lightweight API toolkit, establishing a robust, scalable RAG infrastructure for scholarly writing.

Technology Category

Application Category

📝 Abstract

Large Language Models provide significant new opportunities for the generation of high-quality written works. However, their employment in the research community is inhibited by their tendency to hallucinate invalid sources and lack of direct access to a knowledge base of relevant scientific articles. In this work, we present Citegeist: An application pipeline using dynamic Retrieval Augmented Generation (RAG) on the arXiv Corpus to generate a related work section and other citation-backed outputs. For this purpose, we employ a mixture of embedding-based similarity matching, summarization, and multi-stage filtering. To adapt to the continuous growth of the document base, we also present an optimized way of incorporating new and modified papers. To enable easy utilization in the scientific community, we release both, a website (https://citegeist.org), as well as an implementation harness that works with several different LLM implementations.

Problem

Research questions and friction points this paper is trying to address.

Generates citation-backed related work sections using arXiv Corpus

Addresses LLM hallucination by employing dynamic RAG pipeline

Optimizes integration of new papers for continuous corpus growth

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic RAG on arXiv Corpus

Embedding-based similarity and summarization

Optimized for continuous document growth

🔎 Similar Papers

No similar papers found.