Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
Existing research on automated log generation is largely confined to a single programming language, limiting the evaluation of methods' generalization across multiple languages and real-world code evolution. This work proposes MultiLogBench—the first comprehensive, maintenance-oriented benchmark for automated logging that spans six programming languages, 63,965 code snapshots, and 744 historical cases of log introduction. Under a unified evaluation protocol, we assess seven large language models on tasks including log placement, framework matching, and severity prediction. Our experiments reveal significant performance variations across languages and code structures—particularly in loops and nested calls—with framework anchor matching showing the highest language sensitivity. Only top-tier models exhibit consistent rankings, underscoring the necessity of multilingual evaluation for advancing automated logging research.

Technology Category

Application Category

📝 Abstract
Logging statements are central to debugging, failure diagnosis, and production observability, yet writing them requires developers to decide where to place a logging statement, which API and severity level to use, and what runtime information to expose. Automated logging aims to reduce this burden, but existing evidence remains dominated by Java-centric repository-snapshot dataset. It is therefore unclear whether conclusions about model behavior and model selection generalize across programming-language ecosystems or realistic code evolution. This paper presents MultiLogBench, a multilingual benchmark and empirical study spanning six programming language ecosystems. MultiLogBench contains 63,965 production-code repository-snapshot instances, 744 revision-history cases where developers introduce logging statements during maintenance, and a paired transformed revision-history branch for robustness analysis. Using seven contemporary large language models under a unified protocol, we evaluate logging-site localization, framework-anchor matching, severity prediction, message generation, variable recovery, and cascaded overall quality. Results show clear cross-language variation: framework-anchor matching is the most language-sensitive component, loop and nested-callable sites are the hardest structural contexts, and model rankings are stable only at the top tier. These patterns persist at a coarse level on revision-history data, while transformed inputs do not cause a broad same-direction performance collapse. Overall, MultiLogBench shows that robust claims about automated logging require multilingual evaluation and maintenance-oriented validation.
Problem

Research questions and friction points this paper is trying to address.

automated logging
multilingual benchmark
code evolution
programming languages
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual benchmark
automated logging
large language models
code evolution
framework-anchor matching
🔎 Similar Papers
No similar papers found.