HCAG: Hierarchical Abstraction and Retrieval-Augmented Generation on Theoretical Repositories with LLMs

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenge that existing retrieval-augmented generation methods struggle to model high-level architecture and cross-file dependencies in theory-driven codebases, resulting in a semantic gap between theoretical specifications and implementations. To bridge this gap, we propose HCAG, a novel framework that integrates hierarchical abstraction with architecture-guided generation. HCAG constructs a multi-granularity semantic knowledge base offline, performs hierarchical retrieval online, and prioritizes architectural coherence during modular code generation. It further incorporates adaptive node compression to optimize computational cost and employs a multi-agent collaborative discussion mechanism to enhance planning capabilities. We also introduce the first large-scale dataset explicitly aligned between theoretical constructs and their implementations. Evaluated on algorithmic game theory system generation, HCAG substantially outperforms current approaches, achieving significant improvements in code quality, architectural consistency, and requirement fulfillment, while effectively boosting the performance of domain-specific large language models.

Technology Category

Application Category

📝 Abstract

Existing Retrieval-Augmented Generation (RAG) methods for code struggle to capture the high-level architectural patterns and cross-file dependencies inherent in complex, theory-driven codebases, such as those in algorithmic game theory (AGT), leading to a persistent semantic and structural gap between abstract concepts and executable implementations. To address this challenge, we propose Hierarchical Code/Architecture-guided Agent Generation (HCAG), a framework that reformulates repository-level code generation as a structured, planning-oriented process over hierarchical knowledge. HCAG adopts a two-phase design: an offline hierarchical abstraction phase that recursively parses code repositories and aligned theoretical texts to construct a multi-resolution semantic knowledge base explicitly linking theory, architecture, and implementation; and an online hierarchical retrieval and scaffolded generation phase that performs top-down, level-wise retrieval to guide LLMs in an architecture-then-module generation paradigm. To further improve robustness and consistency, HCAG integrates a multi-agent discussion inspired by cooperative game. We provide a theoretical analysis showing that hierarchical abstraction with adaptive node compression achieves cost-optimality compared to flat and iterative RAG baselines. Extensive experiments on diverse game-theoretic system generation tasks demonstrate that HCAG substantially outperforms representative repository-level methods in code quality, architectural coherence, and requirement pass rate. In addition, HCAG produces a large-scale, aligned theory-implementation dataset that effectively enhances domain-specific LLMs through post-training. Although demonstrated in AGT, HCAG paradigm also offers a general blueprint for mining, reusing, and generating complex systems from structured codebases in other domains.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation

code generation

architectural patterns

cross-file dependencies

theory-implementation gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Abstraction

Retrieval-Augmented Generation

Architecture-Guided Code Generation