Semantic Source Code Segmentation using Small and Large Language Models

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Semantic source code segmentation for low-resource programming languages—such as R—in social science research is hindered by the scarcity of annotated data and the limited scalability of manual or syntactic parsing approaches. Method: This paper proposes an automated semantic source code segmentation framework specifically designed for statistical computing contexts. It introduces a dual-path strategy: (1) context-aware line-by-line sequence labeling and range-aware paragraph segmentation; (2) construction of StatCodeSeg—the first high-quality, domain-specific annotated dataset for statistical code segmentation; and (3) fine-tuning of lightweight models (CodeBERT and encoder-only CodeT5+) on this dataset. Results: Experiments demonstrate that these fine-tuned models significantly outperform large language models (LLMs), even with only 4,130 annotated lines, and exhibit strong cross-lingual transfer capability. The context-aware line-by-line approach achieves optimal performance, establishing an efficient, deployable paradigm for knowledge retrieval and maintenance in low-resource programming languages.

Technology Category

Application Category

📝 Abstract
Source code segmentation, dividing code into functionally coherent segments, is crucial for knowledge retrieval and maintenance in software development. While enabling efficient navigation and comprehension of large codebases, manual and syntactic analysis approaches have become impractical as repositories grow, especially for low-resource languages like R and their research domains (e.g., social sciences, psychology).This paper introduces an automated, domain-specific approach for research R code segmentation using Large and Small Language Models (LLMs/SLMs). It presents two novel approaches and a human-annotated dataset, StatCodeSeg. We explore two distinct approaches: line-by-line analysis with context and range-based segment determination. We experiment with LLMs and fine-tuned SLMs. To support the generalizability of our approaches, we also include experiments on Python code from the computer science domain.Our results show that context-based line-by-line analysis is superior over range-based segmentation.Using smaller language models like CodeBERT and an encoder-only version of CodeT5+ are better than their LLM counterparts. Most notably, these two best-performing models did not see R code during pre-training versus the LLMs but were only fine-tuned on 4,130 lines of manually annotated code.
Problem

Research questions and friction points this paper is trying to address.

Automating semantic segmentation of research R code
Improving code navigation for low-resource languages like R
Comparing line-by-line and range-based segmentation approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated R code segmentation using LLMs/SLMs
Line-by-line analysis outperforms range-based segmentation
Fine-tuned SLMs excel without R pre-training
🔎 Similar Papers
No similar papers found.
A
Abdelhalim Dahou
GESIS - Institute for Social Sciences, Cologne, Germany
Ansgar Scherp
Ansgar Scherp
University of Ulm, Germany
Text AnalyticsData ScienceData Mining/Machine LearningSemantic Web/Linked Open Data
S
Sebastian Kurten
Utrecht University, Utrecht, Netherlands
Brigitte Mathiak
Brigitte Mathiak
GESIS - Leibnizinstitut for Social Sciences
Information Retrieval
M
Madhu Chauhan
IAB - Institut für Arbeitsmarkt- und Berufsforschung, Nürnberg, Germany