StRuCom: A Novel Dataset of Structured Code Comments in Russian

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

Existing machine learning models exhibit significant performance degradation in generating structured code docstrings for Russian compared to English, hindering the maintainability of the Russian-language code ecosystem. To address this, we introduce the first large-scale, multilingual (Python, Java, JavaScript, C#, Go), hybrid (real-world + synthetically augmented) Russian structured docstring dataset—comprising 153K high-quality samples. We propose a novel human-in-the-loop data construction paradigm integrated with automated syntactic and semantic validation against official documentation standards, effectively mitigating terminology distortion and docstring format violations. Leveraging this dataset, we fine-tune the Qwen2.5-Coder family (0.5B–7B parameters) and achieve substantial improvements over strong baselines on both chrF++ (+4.2) and BERTScore (+3.8), demonstrating the dataset’s efficacy, cross-language generalizability, and the methodological innovation of our curation framework.

Technology Category

Application Category

📝 Abstract

Structured code comments in docstring format are essential for code comprehension and maintenance, but existing machine learning models for their generation perform poorly for Russian compared to English. To bridge this gap, we present StRuCom - the first large-scale dataset (153K examples) specifically designed for Russian code documentation. Unlike machine-translated English datasets that distort terminology (e.g., technical loanwords vs. literal translations) and docstring structures, StRuCom combines human-written comments from Russian GitHub repositories with synthetically generated ones, ensuring compliance with Python, Java, JavaScript, C#, and Go standards through automated validation. Fine-tuning Qwen2.5-Coder models (0.5B-7B) on StRuCom shows statistically significant improvements of chrf++ and BERTScore over baseline models.

Problem

Research questions and friction points this paper is trying to address.

Addresses poor performance of ML models in generating Russian structured code comments

Introduces StRuCom, first large-scale dataset for Russian code documentation

Improves model performance via fine-tuning on human-written and synthetic comments

Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale Russian code comment dataset

Combines human-written and synthetic comments

Fine-tuned Qwen2.5-Coder models for improvement

🔎 Similar Papers

No similar papers found.