Design and Evaluation of a Scalable Data Pipeline for AI-Driven Air Quality Monitoring in Low-Resource Settings

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address data scarcity, weak infrastructure, and challenges in processing heterogeneous multi-source data for air quality monitoring in resource-constrained regions, this paper designs and implements AirQo—a cloud-native data pipeline. We propose a modular ETL architecture featuring a decoupled ingestion layer, AI-driven automatic sensor calibration using lightweight ML models, and fault-tolerant mechanisms for network/power outages, all integrated within a fully observable end-to-end framework. Real-time streaming is decoupled via Apache Kafka, batch and stream workflows are orchestrated with Apache Airflow, and analytical workloads are supported by BigQuery. Deployed across 400+ low-cost sensors in Africa, AirQo processes over ten million records monthly with demonstrated stability. Evaluation shows a 32% reduction in calibration error and a 45% decrease in computational resource overhead, validating its scalability, robustness, and cross-regional reusability.

Technology Category

Application Category

📝 Abstract
The increasing adoption of low-cost environmental sensors and AI-enabled applications has accelerated the demand for scalable and resilient data infrastructures, particularly in data-scarce and resource-constrained regions. This paper presents the design, implementation, and evaluation of the AirQo data pipeline: a modular, cloud-native Extract-Transform-Load (ETL) system engineered to support both real-time and batch processing of heterogeneous air quality data across urban deployments in Africa. It is Built using open-source technologies such as Apache Airflow, Apache Kafka, and Google BigQuery. The pipeline integrates diverse data streams from low-cost sensors, third-party weather APIs, and reference-grade monitors to enable automated calibration, forecasting, and accessible analytics. We demonstrate the pipeline's ability to ingest, transform, and distribute millions of air quality measurements monthly from over 400 monitoring devices while achieving low latency, high throughput, and robust data availability, even under constrained power and connectivity conditions. The paper details key architectural features, including workflow orchestration, decoupled ingestion layers, machine learning-driven sensor calibration, and observability frameworks. Performance is evaluated across operational metrics such as resource utilization, ingestion throughput, calibration accuracy, and data availability, offering practical insights into building sustainable environmental data platforms. By open-sourcing the platform and documenting deployment experiences, this work contributes a reusable blueprint for similar initiatives seeking to advance environmental intelligence through data engineering in low-resource settings.
Problem

Research questions and friction points this paper is trying to address.

Designing scalable data pipeline for AI air quality monitoring
Handling heterogeneous data streams in low-resource African settings
Achieving low latency high throughput under constrained conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular cloud-native ETL system
Integrates diverse air quality data streams
Open-source Apache Kafka Airflow infrastructure
🔎 Similar Papers
No similar papers found.
R
Richard Sserujongi
Makerere University, Plot 56 University Pool Road, Kampala, Uganda
D
Daniel Ogenrwot
Makerere University, Plot 56 University Pool Road, Kampala, Uganda; University of Nevada Las Vegas, Las Vegas, NV 89154, USA
N
Nicholas Niwamanya
Makerere University, Plot 56 University Pool Road, Kampala, Uganda
N
Noah Nsimbe
Makerere University, Plot 56 University Pool Road, Kampala, Uganda
M
Martin Bbaale
Makerere University, Plot 56 University Pool Road, Kampala, Uganda
B
Benjamin Ssempala
Makerere University, Plot 56 University Pool Road, Kampala, Uganda
N
Noble Mutabazi
Makerere University, Plot 56 University Pool Road, Kampala, Uganda
R
Raja Fidel Wabinyai
Makerere University, Plot 56 University Pool Road, Kampala, Uganda
D
Deo Okure
Makerere University, Plot 56 University Pool Road, Kampala, Uganda
Engineer Bainomugisha
Engineer Bainomugisha
Professor of Computer Science, Makerere University, Kampala
Programming LanguagesDistributed systemsReactive ProgrammingCloudAI and machine learning