Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

Existing research on automated log generation is largely confined to a single programming language, limiting the evaluation of methods' generalization across multiple languages and real-world code evolution. This work proposes MultiLogBench—the first comprehensive, maintenance-oriented benchmark for automated logging that spans six programming languages, 63,965 code snapshots, and 744 historical cases of log introduction. Under a unified evaluation protocol, we assess seven large language models on tasks including log placement, framework matching, and severity prediction. Our experiments reveal significant performance variations across languages and code structures—particularly in loops and nested calls—with framework anchor matching showing the highest language sensitivity. Only top-tier models exhibit consistent rankings, underscoring the necessity of multilingual evaluation for advancing automated logging research.

Technology Category

Application Category

📝 Abstract

Logging statements are central to debugging, failure diagnosis, and production observability, yet writing them requires developers to decide where to place a logging statement, which API and severity level to use, and what runtime information to expose. Automated logging aims to reduce this burden, but existing evidence remains dominated by Java-centric repository-snapshot dataset. It is therefore unclear whether conclusions about model behavior and model selection generalize across programming-language ecosystems or realistic code evolution. This paper presents MultiLogBench, a multilingual benchmark and empirical study spanning six programming language ecosystems. MultiLogBench contains 63,965 production-code repository-snapshot instances, 744 revision-history cases where developers introduce logging statements during maintenance, and a paired transformed revision-history branch for robustness analysis. Using seven contemporary large language models under a unified protocol, we evaluate logging-site localization, framework-anchor matching, severity prediction, message generation, variable recovery, and cascaded overall quality. Results show clear cross-language variation: framework-anchor matching is the most language-sensitive component, loop and nested-callable sites are the hardest structural contexts, and model rankings are stable only at the top tier. These patterns persist at a coarse level on revision-history data, while transformed inputs do not cause a broad same-direction performance collapse. Overall, MultiLogBench shows that robust claims about automated logging require multilingual evaluation and maintenance-oriented validation.

Problem

Research questions and friction points this paper is trying to address.

automated logging

multilingual benchmark

code evolution

programming languages

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual benchmark

automated logging

large language models