NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the limitations of existing name-to-nationality classifiers, which often suffer from small or homogeneous training data and consequently struggle to accurately identify names from low-resource countries. To overcome this challenge, the authors construct a large-scale name–nationality dataset leveraging open academic graphs and innovatively employ large language models as a data augmentation tool to generate synthetic name samples for underrepresented nationalities. A lightweight yet effective NameBERT classifier is then trained on this enhanced dataset. The proposed approach significantly improves classification performance for tail-end nationalities while maintaining computational efficiency suitable for real-time inference. Experimental results demonstrate that NameBERT consistently outperforms current baselines on both in-domain and out-of-domain evaluations, with particularly strong gains on test sets containing synthetic low-resource names, thereby enabling scalable deployment.

Technology Category

Application Category

📝 Abstract

Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.

Problem

Research questions and friction points this paper is trying to address.

name-based nationality classification

coverage gaps

underrepresented countries

large-scale inference

computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

name-based nationality classification

LLM-augmented data

data augmentation