A systematic machine learning approach to measure and assess biases in mobile phone population data

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Traditional demographic data (e.g., censuses) suffer from high cost and low timeliness, while mobile application data—despite offering high spatiotemporal resolution—exhibit systematic coverage bias due to digital inequality and lack standardized evaluation frameworks. Method: We develop the first reproducible, attribute-agnostic framework for quantifying coverage bias, integrating aggregated mobile data with census benchmarks to derive transparent coverage metrics; we further employ interpretable machine learning to uncover nonlinear geographic, socioeconomic, and demographic drivers of bias. Contribution/Results: Validated on four UK datasets, our approach challenges the assumption that multi-source data inherently improve representativeness. Although mobile data achieve higher overall coverage than traditional surveys, they exhibit pronounced and complex spatial biases. The framework establishes a methodological foundation and practical toolkit for rigorously assessing the reliability of digital trace data in demographic inference.

Technology Category

Application Category

📝 Abstract

Traditional sources of population data, such as censuses and surveys, are costly, infrequent, and often unavailable in crisis-affected regions. Mobile phone application data offer near real-time, high-resolution insights into population distribution, but their utility is undermined by unequal access to and use of digital technologies, creating biases that threaten representativeness. Despite growing recognition of these issues, there is still no standard framework to measure and explain such biases, limiting the reliability of digital traces for research and policy. We develop and implement a systematic, replicable framework to quantify coverage bias in aggregated mobile phone application data without requiring individual-level demographic attributes. The approach combines a transparent indicator of population coverage with explainable machine learning to identify contextual drivers of spatial bias. Using four datasets for the United Kingdom benchmarked against the 2021 census, we show that mobile phone data consistently achieve higher population coverage than major national surveys, but substantial biases persist across data sources and subnational areas. Coverage bias is strongly associated with demographic, socioeconomic, and geographic features, often in complex nonlinear ways. Contrary to common assumptions, multi-application datasets do not necessarily reduce bias compared to single-app sources. Our findings establish a foundation for bias assessment standards in mobile phone data, offering practical tools for researchers, statistical agencies, and policymakers to harness these datasets responsibly and equitably.

Problem

Research questions and friction points this paper is trying to address.

Measuring coverage bias in mobile phone population data without demographic attributes

Identifying contextual drivers of spatial bias using explainable machine learning

Establishing bias assessment standards for mobile phone data reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic framework to quantify mobile data coverage bias

Explainable machine learning identifies spatial bias drivers

Transparent indicator without individual demographic attributes

🔎 Similar Papers

Sample Selection Bias in Machine Learning for Healthcare