π€ AI Summary
Software developers face inefficient and time-consuming challenges in comprehending large, multifunctional codebases; existing README-based, coarse-grained project-level categorization fails to support fine-grained functional understanding. To address this, we propose AutoFLβthe first automated, cross-granularity functional domain labeling method supporting file-, package-, and project-level annotations without relying on non-code documentation (e.g., READMEs). AutoFL directly models source code semantics via a weakly supervised learning framework that integrates code text parsing, multi-granularity semantic embedding, and hierarchical aggregation for end-to-end label generation. Evaluated across multilingual open-source projects, AutoFL significantly improves the accuracy, consistency, and interpretability of functional labels compared to baselines. It effectively alleviates key bottlenecks in software comprehension by enabling precise, scalable, and documentation-agnostic functional awareness.
π Abstract
Software comprehension, especially of new code bases, is time consuming for developers, especially in large projects with multiple functionalities spanning various domains. One strategy to reduce this effort involves annotating files with meaningful labels that describe the functionalities contained. However, prior research has so far focused on classifying the whole project using README files as a proxy, resulting in little information gained for the developers. Our objective is to streamline the labelling of files with the correct application domains using source code as input. To achieve this, in prior work, we evaluated the ability to annotate files automatically using a weak labelling approach. This paper presents AutoFL, a tool for automatically labelling software repositories from source code. AutoFL allows multi-granular annotations including: extit{file}, extit{package}, and extit{project} -level. We provide an overview of the tool's internals, present an example analysis for which AutoFL can be used, and discuss limitations and future work.