Defects4Log: Benchmarking LLMs for Logging Code Defect Detection and Reasoning

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Logging code defects severely impair system observability, yet existing studies suffer from narrow defect-pattern coverage, single-source data, and a lack of systematic exploration of large language models (LLMs) for this task. Method: We introduce LogBench—the first benchmark dataset for logging defect detection—comprising 164 developer-validated real-world defects; propose a fine-grained taxonomy covering seven defect patterns across 14 scenarios; and design a knowledge-enhanced LLM evaluation framework integrating domain-specific defect knowledge with multi-strategy prompting. Contribution/Results: Experiments show that LLMs achieve limited accuracy when processing raw code alone; incorporating structured defect knowledge improves detection accuracy by 10.9%, strongly validating the efficacy of knowledge-guided reasoning. This work establishes a new paradigm and empirical foundation for logging quality assurance and trustworthy LLM application in systems software analysis.

Technology Category

Application Category

📝 Abstract

Logging code is written by developers to capture system runtime behavior and plays a vital role in debugging, performance analysis, and system monitoring. However, defects in logging code can undermine the usefulness of logs and lead to misinterpretations. Although prior work has identified several logging defect patterns and provided valuable insights into logging practices, these studies often focus on a narrow range of defect patterns derived from limited sources (e.g., commit histories) and lack a systematic and comprehensive analysis. Moreover, large language models (LLMs) have demonstrated promising generalization and reasoning capabilities across a variety of code-related tasks, yet their potential for detecting logging code defects remains largely unexplored. In this paper, we derive a comprehensive taxonomy of logging code defects, which encompasses seven logging code defect patterns with 14 detailed scenarios. We further construct a benchmark dataset, dataset, consisting of 164 developer-verified real-world logging defects. Then we propose an automated framework that leverages various prompting strategies and contextual information to evaluate LLMs' capability in detecting and reasoning logging code defects. Experimental results reveal that LLMs generally struggle to accurately detect and reason logging code defects based on the source code only. However, incorporating proper knowledge (e.g., detailed scenarios of defect patterns) can lead to 10.9% improvement in detection accuracy. Overall, our findings provide actionable guidance for practitioners to avoid common defect patterns and establish a foundation for improving LLM-based reasoning in logging code defect detection.

Problem

Research questions and friction points this paper is trying to address.

Identifying logging code defects in software systems

Evaluating LLMs for logging defect detection and reasoning

Improving LLM accuracy with contextual defect pattern knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive taxonomy of logging code defects

Benchmark dataset with real-world logging defects

Automated framework leveraging LLMs and contextual information

🔎 Similar Papers

Automated Defects Detection and Fix in Logging Statement