🤖 AI Summary
Insecure logging practices can lead to sensitive information leakage or log injection attacks, compromising system security and user privacy. This work presents the first systematic characterization of log-related security issues, defining four categories and ten concrete vulnerability patterns. The authors construct the first human-annotated benchmark dataset of real-world insecure logging instances. Leveraging this benchmark, they develop an automated evaluation framework to assess the capability of large language models (LLMs) in detecting and repairing logging vulnerabilities. Experimental results show that LLMs achieve detection accuracies ranging from 12.9% to 52.5% and exhibit limited repair effectiveness. Notably, providing only the problem description yields better detection performance than additionally supplying explanations of the associated security patterns.
📝 Abstract
Logging code plays an important role in software systems by recording key events and behaviors, which are essential for debugging and monitoring. However, insecure logging practices can inadvertently expose sensitive information or enable attacks such as log injection, posing serious threats to system security and privacy. Prior research has examined general defects in logging code, but systematic analysis of logging code security issues remains limited, particularly in leveraging LLMs for detection and repair. In this paper, we derive a comprehensive taxonomy of logging code security issues, encompassing four common issue categories and 10 corresponding patterns. We further construct a benchmark dataset with 101 real-world logging security issue reports that have been manually reviewed and annotated. We then propose an automated framework that incorporates various contextual knowledge to evaluate LLMs' capabilities in detecting and repairing logging security issues. Our experimental results reveal a notable disparity in performance: while LLMs are moderately effective at detecting security issues (e.g., the accuracy ranges from 12.9% to 52.5% on average), they face noticeable challenges in reliably generating correct code repairs. We also find that the issue description alone improves the LLMs' detection accuracy more than the security pattern explanation or a combination of both. Overall, our findings provide actionable insights for practitioners and highlight the potential and limitations of current LLMs for secure logging.