🤖 AI Summary
Software logging faces a dual challenge: excessive logging incurs high operational costs, while insufficient logging introduces significant debugging and monitoring risks. Existing tools lack principled support for the fundamental “whether-to-log” decision and fail to model the multi-stage, compositional nature of log generation. This paper proposes AutoLogger—a novel hybrid multi-agent framework that comprehensively covers the full logging decision chain: *judgment* (whether to log), *localization* (where to insert), and *generation* (what to log). AutoLogger integrates static program analysis, retrieval-augmented reasoning, fine-tuned binary classifiers, and LLM-based evaluation mechanisms. A dedicated judgment model accurately identifies logging necessity; a localization agent and a generation agent jointly perform context-aware log injection. Evaluated on three open-source projects, AutoLogger achieves 96.63% F1-score in logging necessity classification and improves end-to-end log quality—assessed via LLM-as-a-judge—by 16.13% over the strongest baseline, with compatibility across diverse LLM backbones.
📝 Abstract
Software logging is critical for system observability, yet developers face a dual crisis of costly overlogging and risky underlogging. Existing automated logging tools often overlook the fundamental whether-to-log decision and struggle with the composite nature of logging. In this paper, we propose Autologger, a novel hybrid framework that addresses the complete the end-to-end logging pipeline. Autologger first employs a fine-tuned classifier, the Judger, to accurately determine if a method requires new logging statements. If logging is needed, a multi-agent system is activated. The system includes specialized agents: a Locator dedicated to determining where to log, and a Generator focused on what to log. These agents work together, utilizing our designed program analysis and retrieval tools. We evaluate Autologger on a large corpus from three mature open-source projects against state-of-the-art baselines. Our results show that Autologger achieves 96.63% F1-score on the crucial whether-to-log decision. In an end-to-end setting, Autologger improves the overall quality of generated logging statements by 16.13% over the strongest baseline, as measured by an LLM-as-a-judge score. We also demonstrate that our framework is generalizable, consistently boosting the performance of various backbone LLMs.