๐ค AI Summary
This work addresses the limitations of existing name-to-nationality classifiers, which often suffer from small or homogeneous training data and consequently struggle to accurately identify names from low-resource countries. To overcome this challenge, the authors construct a large-scale nameโnationality dataset leveraging open academic graphs and innovatively employ large language models as a data augmentation tool to generate synthetic name samples for underrepresented nationalities. A lightweight yet effective NameBERT classifier is then trained on this enhanced dataset. The proposed approach significantly improves classification performance for tail-end nationalities while maintaining computational efficiency suitable for real-time inference. Experimental results demonstrate that NameBERT consistently outperforms current baselines on both in-domain and out-of-domain evaluations, with particularly strong gains on test sets containing synthetic low-resource names, thereby enabling scalable deployment.
๐ Abstract
Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.