The Impact of Input Order Bias on Large Language Models for Software Fault Localization

📅 2024-12-25

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work identifies a critical input-order sensitivity in large language models (LLMs) for software fault localization (FL): reversing code line order reduces Top-1 accuracy from 57% to 20%. We present the first systematic quantification of this ordering bias and propose a robust input representation method combining dependency-graph-based (DepGraph) code ordering and context chunking. DepGraph ordering alone improves Top-1 accuracy to 48%; integrating chunking further reduces the performance gap between optimal and worst input orders—measured by Kendall Tau—from 22% to 1%, yielding a 95% improvement in order robustness. Beyond empirically establishing LLMs’ structural dependence on syntactic and semantic code ordering in FL, our approach introduces a generalizable context organization paradigm. It significantly enhances LLM reliability and generalization in realistic defect localization scenarios, offering a principled solution to input-sensitivity challenges in program understanding tasks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) show great promise in software engineering tasks like Fault Localization (FL) and Automatic Program Repair (APR). This study examines how input order and context size affect LLM performance in FL, a key step for many downstream software engineering tasks. We test different orders for methods using Kendall Tau distances, including"perfect"(where ground truths come first) and"worst"(where ground truths come last). Our results show a strong bias in order, with Top-1 accuracy falling from 57% to 20% when we reverse the code order. Breaking down inputs into smaller contexts helps reduce this bias, narrowing the performance gap between perfect and worst orders from 22% to just 1%. We also look at ordering methods based on traditional FL techniques and metrics. Ordering using DepGraph's ranking achieves 48% Top-1 accuracy, better than more straightforward ordering approaches like CallGraph. These findings underscore the importance of how we structure inputs, manage contexts, and choose ordering methods to improve LLM performance in FL and other software engineering tasks.

Problem

Research questions and friction points this paper is trying to address.

Information Sequence

Information Volume

Software Fault Localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Software Fault Localization

Information Chunking

🔎 Similar Papers

A Multi-Agent Approach to Fault Localization via Graph-Based Retrieval and Reflexion