A Scalable Benchmark for Repository-Oriented Long-Horizon Conversational Context Management

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of long-context dialogue over code repositories, where large language models often lose critical information due to excessive context length, and existing context management methods lack targeted evaluation benchmarks. To bridge this gap, we introduce LoCoEval, the first benchmark specifically designed for this scenario, which leverages an LLM-driven pipeline to generate realistic and diverse dialogue-repository interaction data. We further propose a unified memory mechanism that integrates dialogue history with repository structure. Comprehensive evaluation of seven baseline methods on LoCoEval reveals their limitations, while our approach—operating without oracle information—significantly outperforms all baselines and demonstrates superior robustness and effectiveness across multiple metrics.

Technology Category

Application Category

📝 Abstract
In recent years, large language models (LLMs) have advanced rapidly, substantially enhancing their code understanding and generation capabilities and giving rise to powerful code assistants. However, in practical repository development, excessively long-horizon conversational context may overwhelm models, causing the loss of critical information and degraded performance, thereby limiting the utility of code assistants. Existing context management methods proposed to mitigate this context dilemma primarily target general-purpose conversations, while repository-oriented solutions remain largely unexplored, which is largely due to the lack of reliable evaluation benchmarks. To bridge this gap, we present LoCoEval, the first long-horizon conversational context management benchmark tailored to repository-oriented development scenarios. Adhering to three key principles, LoCoEval is constructed via an LLM-driven pipeline that generates realistic and diverse repository-oriented conversations, capturing key interaction patterns such as iterative requirements, noisy input, and retrospective questions. We evaluate 7 baselines, including 4 representative context management methods, using 3 advanced backbone LLMs on LoCoEval. The results reveal substantial challenges faced by standalone LLMs and existing approaches, especially memory systems, in repository-oriented conversational scenarios. To address these limitations, we further propose an improved method integrating conversational and repository information into a unified memory, which outperforms all baselines (*Oracle* excluded) and demonstrates robustness. Additionally, we investigated the impact of various factors on method performance, providing actionable insights for future research.
Problem

Research questions and friction points this paper is trying to address.

long-horizon conversation
context management
repository-oriented development
code assistant
evaluation benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon conversation
repository-oriented context management
LLM-driven benchmark
unified memory
code assistant evaluation
🔎 Similar Papers
No similar papers found.
Y
Yang Liu
State Key Laboratory of Complex & Critical Software Environment, School of Computer Science and Engineering, Beihang University, China
Li Zhang
Li Zhang
Associate Professor, School of Software, Tsinghua University
Big DataWorkflowInformation System
Fang Liu
Fang Liu
Beihang University
AI4SELLMsCode UnderstandingCode Generation
Ping Lin
Ping Lin
Scientist, University of Florida
computational chemistry
X
Xinyi Li
State Key Laboratory of Complex & Critical Software Environment, School of Computer Science and Engineering, Beihang University, China