RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

175K/year
🤖 AI Summary
This work addresses the challenges of maintaining documentation in large codebases—namely, the lack of semantic structure in existing tool-generated content and difficulties in tracking changes—by proposing Repository Knowledge Graphs (RepoKG). RepoKG introduces a three-stage pipeline comprising code entity relation extraction, functional module clustering, and agent-driven documentation generation, establishing knowledge graphs as the semantic foundation for the entire documentation lifecycle. It incorporates modular hierarchical organization and a bidirectional semantic influence propagation mechanism to enable structured, cross-referable documentation with efficient incremental updates. Evaluated across 24 multilingual repositories, RepoKG improves API coverage by 32.5% and completeness by 10.4%, while accelerating generation by 3× and reducing token consumption by 85%. For incremental updates, it cuts update time by 73%, lowers token usage by 77%, and increases update recall by 10.2%.
📝 Abstract
Maintaining up-to-date, comprehensive documentation for large codebases is a persistent challenge. Recent progress in automated documentation has moved from template-based rules to large language models (LLMs), yet existing tools still process source code as flat fragments, producing isolated documents that lack semantic structure. This design also leads to excessive token consumption and slow generation, while failing to capture how code changes propagate across dependencies. We propose RepoDoc, a system that uses a repository knowledge graph (RepoKG) as the semantic foundation for the entire documentation lifecycle. Our framework consists of three stages: (1) RepoKG construction, which extracts code entities and their relationships; (2) module clustering, which groups code into functionally cohesive, hierarchical units; and (3) skillful agent-based generation, which queries the graph to create modular, cross-referenced documentation with auto-generated Mermaid diagrams. For incremental maintenance, a semantic impact propagation mechanism navigates the RepoKG bidirectionally to pinpoint all affected parts, allowing selective, targeted regeneration. Evaluated on 24 repositories across 8 programming languages, RepoDoc substantially outperforms state-of-the-art alternatives. It improves API coverage by 32.5% and completeness by 10.4%, while generating documentation 3x faster with 85% fewer tokens. For incremental updates, it cuts update time by 73% and token usage by 77%, and achieves 10.2% higher update recall, more accurately reflecting code changes in the regenerated documentation. The source code and experimental artifacts are available at https://github.com/SYSUSELab/RepoDoc.
Problem

Research questions and friction points this paper is trying to address.

automatic documentation generation
incremental updates
knowledge graph
code documentation
semantic structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

repository knowledge graph
incremental documentation update
semantic impact propagation
agent-based documentation generation
code dependency modeling
🔎 Similar Papers
No similar papers found.