Sorting the Babble in Babel: Assessing the Performance of Language Detection Algorithms on the OpenAlex Database

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study addresses the poor quality of language metadata in the OpenAlex database by systematically evaluating multilingual identification performance of algorithms—including LangID and FastSpell—across combinations of title, abstract, and journal name fields. We propose the first multidimensional evaluation framework tailored to large-scale academic databases, jointly optimizing precision, recall, and inference time, and develop a database-level performance estimation model based on probabilistic simulation. Results show that FastSpell achieves optimal recall or efficiency when using titles alone, whereas LangID attains peak accuracy by fusing all three fields. The study not only validates OpenAlex’s suitability for cross-lingual bibliometric analysis but also introduces, for the first time, Monte Carlo simulation and weighted harmonic scoring into language identification evaluation. This yields a reusable methodological paradigm for scholarly metadata governance. (149 words)

Technology Category

Application Category

📝 Abstract

Following a recent study on the quality of OpenAlex linguistic metadata (C'espedes et al., 2025), the present paper aims to optimize the latter through the design, use, and evaluation of various linguistic classification procedures based on the latest and most efficient automatic language detection algorithms. Starting from a multilingual set of manually-annotated samples of articles indexed in the database, different classification procedures are then designed, based on the application of a set of language detection algorithms on a series of corpora generated from different combinations of textual metadata of indexed articles. At sample level first, the performance of these different procedures for each of the main languages in the database is evaluated in terms of precision, recall, and processing time. Then, overall procedure performance is estimated at the database level by means of a probabilistic simulation of harmonically aggregated and weighted scores. Results show that procedure performance strongly depends on the importance given to each of the measures implemented: for contexts where precision is preferred, using the LangID algorithm on article titles, abstracts as well as journal names gives the best results; however, for all cases where recall is considered at least slightly more important than precision or as soon as processing times are given any kind of consideration, use of the FastSpell algorithm on article titles only outperforms all other alternatives. Given the lack of truly multilingual, large-scale bibliographic databases, it is hoped that these results help confirm and foster the unparalleled potential of the OpenAlex database for cross-linguistic, bibliometric-based research and analysis.

Problem

Research questions and friction points this paper is trying to address.

Optimize linguistic metadata in OpenAlex

Evaluate language detection algorithm performance

Enhance multilingual bibliographic database utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes latest language detection algorithms

Evaluates precision, recall, processing time

Simulates database-level performance probabilistically

🔎 Similar Papers

No similar papers found.