Metadata Conditioned Large Language Models for Localization

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models often exhibit geographic homogenization due to the scarcity of geographically diverse data in their training corpora, limiting their effectiveness for localized tasks. To address this, this work proposes a metadata-conditioned pretraining approach that leverages geographic metadata—such as URLs, countries, and continents—as conditioning signals to train 31 models from scratch at 0.5B and 1B scales. The method significantly enhances in-region performance without compromising cross-regional generalization and achieves comparable efficacy to specialized local models using substantially less data. Evaluated on a newly constructed benchmark of 800 multiple-choice questions derived from localized news articles, the models attain accuracy on par with LLaMA-3.2-1B-Instruct, demonstrating the effectiveness of URL-level metadata and the critical importance of balanced regional data representation.

Technology Category

Application Category

📝 Abstract
Large language models are typically trained by treating text as a single global distribution, often resulting in geographically homogenized behavior. We study metadata conditioning as a lightweight approach for localization, pre-training 31 models (at 0.5B and 1B parameter scales) from scratch on large-scale English news data annotated with verified URLs, country tags, and continent tags, covering 4 continents and 17 countries. Across four controlled experiments, we show that metadata conditioning consistently improves in-region performance without sacrificing cross-region generalization, enables global models to recover localization comparable to region-specific models, and improves learning efficiency. Our ablation studies demonstrate that URL-level metadata alone captures much of the geographic signal, while balanced regional data coverage remains essential, as metadata cannot fully compensate for missing regions. Finally, we introduce a downstream benchmark of 800 localized news MCQs and show that after instruction tuning, metadata conditioned global models achieve accuracy comparable to LLaMA-3.2-1B-Instruct, despite being trained on substantially less data. Together, these results establish metadata conditioning as a practical and compute-efficient approach for localization of language models.
Problem

Research questions and friction points this paper is trying to address.

localization
geographic bias
language model homogenization
metadata conditioning
cross-region generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

metadata conditioning
localization
large language models
geographic adaptation
efficient pre-training
🔎 Similar Papers
No similar papers found.
A
Anjishnu Mukherjee
Department of Computer Science, George Mason University
Ziwei Zhu
Ziwei Zhu
Assistant Professor at George Mason University
data mininginformation retrievalmachine learningresponsible AI
A
Antonios Anastasopoulos
Department of Computer Science, George Mason University