Defects4Log: Benchmarking LLMs for Logging Code Defect Detection and Reasoning

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Logging code defects severely impair system observability, yet existing studies suffer from narrow defect-pattern coverage, single-source data, and a lack of systematic exploration of large language models (LLMs) for this task. Method: We introduce LogBench—the first benchmark dataset for logging defect detection—comprising 164 developer-validated real-world defects; propose a fine-grained taxonomy covering seven defect patterns across 14 scenarios; and design a knowledge-enhanced LLM evaluation framework integrating domain-specific defect knowledge with multi-strategy prompting. Contribution/Results: Experiments show that LLMs achieve limited accuracy when processing raw code alone; incorporating structured defect knowledge improves detection accuracy by 10.9%, strongly validating the efficacy of knowledge-guided reasoning. This work establishes a new paradigm and empirical foundation for logging quality assurance and trustworthy LLM application in systems software analysis.

Technology Category

Application Category

📝 Abstract
Logging code is written by developers to capture system runtime behavior and plays a vital role in debugging, performance analysis, and system monitoring. However, defects in logging code can undermine the usefulness of logs and lead to misinterpretations. Although prior work has identified several logging defect patterns and provided valuable insights into logging practices, these studies often focus on a narrow range of defect patterns derived from limited sources (e.g., commit histories) and lack a systematic and comprehensive analysis. Moreover, large language models (LLMs) have demonstrated promising generalization and reasoning capabilities across a variety of code-related tasks, yet their potential for detecting logging code defects remains largely unexplored. In this paper, we derive a comprehensive taxonomy of logging code defects, which encompasses seven logging code defect patterns with 14 detailed scenarios. We further construct a benchmark dataset, dataset, consisting of 164 developer-verified real-world logging defects. Then we propose an automated framework that leverages various prompting strategies and contextual information to evaluate LLMs' capability in detecting and reasoning logging code defects. Experimental results reveal that LLMs generally struggle to accurately detect and reason logging code defects based on the source code only. However, incorporating proper knowledge (e.g., detailed scenarios of defect patterns) can lead to 10.9% improvement in detection accuracy. Overall, our findings provide actionable guidance for practitioners to avoid common defect patterns and establish a foundation for improving LLM-based reasoning in logging code defect detection.
Problem

Research questions and friction points this paper is trying to address.

Identifying logging code defects in software systems
Evaluating LLMs for logging defect detection and reasoning
Improving LLM accuracy with contextual defect pattern knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive taxonomy of logging code defects
Benchmark dataset with real-world logging defects
Automated framework leveraging LLMs and contextual information
🔎 Similar Papers
No similar papers found.
X
Xin Wang
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Z
Zhenhao Li
York University, Toronto, Canada
Zishuo Ding
Zishuo Ding
The Hong Kong University of Science and Technology (Guangzhou)
Software Engineering