"Write in English, Nobody Understands Your Language Here": A Study of Non-English Trends in Open-Source Repositories

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the persistent dominance of English in open-source communities and its implications for linguistic inclusivity and collaboration equity. Analyzing a large-scale dataset comprising 9.14 billion interactions and 62,500 repositories on GitHub from 2015 to 2025, the research quantifies usage trends of 30 languages across code comments, string literals, documentation, and discussions. Employing multilingual text mining, statistical modeling, and cross-lingual classification, the findings reveal significant growth in non-English content—particularly in Korean, Chinese, and Russian—yet such projects consistently exhibit lower engagement and visibility compared to English-based counterparts. These results underscore language’s dual role as both a resource for inclusion and a barrier to equitable participation in global software development ecosystems.

Technology Category

Application Category

📝 Abstract
The open-source software (OSS) community has historically been dominated by English as the primary language for code, documentation, and developer interactions. However, with growing global participation and better support for non-Latin scripts through standards like Unicode, OSS is gradually becoming more multilingual. This study investigates the extent to which OSS is becoming more multilingual, analyzing 9.14 billion GitHub issues, pull requests, and discussions, and 62,500 repositories across five programming languages and 30 natural languages, covering the period from 2015 to 2025. We examine six research questions to track changes in language use across communication, code, and documentation. We find that multilingual participation has steadily increased, especially in Korean, Chinese, and Russian. This growth appears not only in issues and discussions but also in code comments, string literals, and documentation files. While this shift reflects greater inclusivity and language diversity in OSS, it also creates language tension. The ability to express oneself in a native language can clash with shared norms around English use, especially in collaborative settings. Non-English or multilingual projects tend to receive less visibility and participation, suggesting that language remains both a resource and a barrier, shaping who gets heard, who contributes, and how open collaboration unfolds.
Problem

Research questions and friction points this paper is trying to address.

open-source software
multilingualism
language diversity
collaboration barriers
non-English content
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual OSS
language diversity
GitHub language analysis
non-English software development
language tension