Towards Enhancing Data Equity in Public Health Data Science

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Underrepresentation in public health data risks systemic bias, undermining the fairness and validity of downstream inference and policy decisions. To address this, we propose an operational definition of “public health data fairness,” integrating computational principles—fairness, accountability, transparency, ethics, and privacy—with core public health methodologies—including selection bias correction, representativeness assessment, and causal inference. This yields a structured, lifecycle-spanning self-audit framework grounded in reflexive practice and designed for seamless integration into routine data science workflows. Validated across multiple real-world public health applications, the framework demonstrably enhances the equitable applicability of AI and data-driven policies across diverse populations. Crucially, our analysis clarifies that data fairness constitutes a necessary—but not sufficient—condition for fair decision-making.

Technology Category

Application Category

📝 Abstract
Data-driven decisions shape public health policies and practice, yet persistent disparities in data representation skew insights and undermine interventions. To address this, we advance a structured roadmap that integrates public health data science with computer science and is grounded in reflexivity. We adopt data equity as a guiding concept: ensuring the fair and inclusive representation, collection, and use of data to prevent the introduction or exacerbation of systemic biases that could lead to invalid downstream inference and decisions. To underscore urgency, we present three public health cases where non-representative datasets and skewed knowledge impede decisions across diverse subgroups. These challenges echo themes in two literatures: public health highlights gaps in high-quality data for specific populations, while computer science and statistics contribute criteria and metrics for diagnosing bias in data and models. Building on these foundations, we propose a working definition of public health data equity and a structured self-audit framework. Our framework integrates core computational principles (fairness, accountability, transparency, ethics, privacy, confidentiality) with key public health considerations (selection bias, representativeness, generalizability, causality, information bias) to guide equitable practice across the data life cycle, from study design and data collection to measurement, analysis, interpretation, and translation. Embedding data equity in routine practice offers a practical path for ensuring that data-driven policies, artificial intelligence, and emerging technologies improve health outcomes for all. Finally, we emphasize the critical understanding that, although data equity is an essential first step, it does not inherently guarantee information, learning, or decision equity.
Problem

Research questions and friction points this paper is trying to address.

Addressing data representation disparities in public health decisions
Integrating data equity concepts to prevent systemic biases
Proposing a framework for equitable data practices across lifecycle
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates public health and computer science principles
Proposes self-audit framework for data lifecycle equity
Combines computational fairness with public health representativeness
🔎 Similar Papers
No similar papers found.
Y
Yiran Wang
Department of Biostatistics, Yale School of Public Health, Yale University, 60 College Street, New Haven, CT 06510, USA
A
Alicia E. Boyd
Department of Biostatistics, Yale School of Public Health, Yale University, 60 College Street, New Haven, CT 06510, USA
L
Lillian Rountree
Department of Biostatistics, Yale School of Public Health, Yale University, 60 College Street, New Haven, CT 06510, USA
Y
Yi Ren
Department of Biostatistics, Yale School of Public Health, Yale University, 60 College Street, New Haven, CT 06510, USA
K
Kate Nyhan
Harvey Cushing/John Hay Whitney Medical Library, Yale University, 333 Cedar St, New Haven, CT 06510, USA; Department of Environmental Health Sciences, Yale School of Public Health, Yale University, 60 College Street, New Haven, CT 06510, USA
R
Ruchit Nagar
Department of Pediatrics, Yale–New Haven Hospital, New Haven, CT 06510, USA; Department of Internal Medicine, Yale–New Haven Hospital, New Haven, CT 06410, USA
J
Jackson Higginbottom
Department of Biostatistics, Yale School of Public Health, Yale University, 60 College Street, New Haven, CT 06510, USA
M
Megan L. Ranney
Department of Health Policy and Management, Yale School of Public Health, Yale University, 60 College Street, New Haven, CT 06510, USA
Harsh Parikh
Harsh Parikh
Yale University
Causal InferenceCausalityEconometricsMachine LearningStatistics
Bhramar Mukherjee
Bhramar Mukherjee
Professor of Biostatistics, Yale School of Public Health
BiostatisticsCancerGeneticsStatisticsEnvironmental Health