Multi-Agent Systems for Dataset Adaptation in Software Engineering: Capabilities, Limitations, and Future Directions

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study addresses the limited scalability and reproducibility in automating adaptations of Software Engineering (SE) datasets. We propose and empirically evaluate, for the first time, a multi-agent system framework powered by large language models (GPT-4.1/Claude Sonnet 4). The framework integrates prompt engineering, code diff analysis, and error feedback mechanisms, and is rigorously assessed across five evaluation stages on benchmarks including ROCDE and LogHub2.0 to measure adaptation capability in code understanding, modification, and command execution. Key findings show that contextual feedback and error information substantially improve adaptation quality—structural similarity increases from 7.25% to 67.14%. Our principal contribution lies in empirically demonstrating the critical role of self-correction mechanisms in agent performance and establishing an evidence-based design paradigm for robust, SE-dataset-oriented multi-agent systems.

Technology Category

Application Category

📝 Abstract

Automating the adaptation of software engineering (SE) research artifacts across datasets is essential for scalability and reproducibility, yet it remains largely unstudied. Recent advances in large language model (LLM)-based multi-agent systems, such as GitHub Copilot's agent mode, promise to automate complex development workflows through coordinated reasoning, code generation, and tool interaction. This paper presents the first empirical study on how state-of-the-art multi-agent systems perform in dataset adaptation tasks. We evaluate Copilot, backed by GPT-4.1 and Claude Sonnet 4, on adapting SE research artifacts from benchmark repositories including ROCODE and LogHub2.0. Through a five-stage evaluation pipeline (file comprehension, code editing, command generation, validation, and final execution), we measure success rates, analyze failure patterns, and assess prompt-based interventions designed to enhance agent performance. Results show that current systems can identify key files and generate partial adaptations but rarely produce functionally correct implementations. Prompt-level interventions, especially providing execution error messages and reference code, substantially improve structural similarity to ground truth (from 7.25% to 67.14%), highlighting the importance of contextual and feedback-driven guidance. Our findings reveal both the promise and limitations of today's multi-agent LLM systems for dataset adaptation, and suggest concrete directions for building more reliable, self-correcting agents in future SE research.

Problem

Research questions and friction points this paper is trying to address.

Evaluates multi-agent LLM systems for adapting software engineering datasets.

Measures success rates and failure patterns in dataset adaptation tasks.

Assesses prompt interventions to improve agent performance and reliability.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent LLM systems automate dataset adaptation workflows

Five-stage evaluation pipeline measures adaptation success and failures

Prompt interventions with error feedback improve structural similarity

🔎 Similar Papers

Can Large Language Models Serve as Data Analysts? A Multi-Agent Assisted Approach for Qualitative Data Analysis