From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work investigates the persistent understanding and execution capabilities of large language models (LLMs) in long-term, multi-session programming collaboration—specifically their ability to track cross-session instructions, resist interference from irrelevant context, and retrieve/integrate critical information across extended instruction chains. To this end, we introduce MemoryCode, the first synthetic multi-session programming dataset featuring contextually noisy inputs and inter-turn dependencies, and propose a rigorous evaluation framework grounded in retrieval consistency and execution accuracy. Our systematic analysis reveals a previously uncharacterized bottleneck: severe cross-session memory decay and failure in information integration. While state-of-the-art models like GPT-4o achieve high single-turn accuracy, their performance drops by over 40% on three-or-more-turn collaborative coding tasks—primarily due to critical instruction forgetting and spurious context associations. This study establishes foundational benchmarks and identifies concrete directions for developing reliable, long-horizon AI programming assistants.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs' ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs in multi-session coding

Test LLMs' ability to track instructions

Assess long-term collaboration effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

MemoryCode dataset creation

Multi-session interaction testing

Long instruction chain analysis

🔎 Similar Papers

No similar papers found.