Analyzing the Effects of Data Variability and Quantity on Predicting Particulate Matter (PM2.5) Concentrations: Insights from a Machine Learning Approach
DOI:
https://doi.org/10.63697/jeshs.2025.10042Keywords:
PM2.5 concentration, MODIS AOD, Air quality management, Artificial intelligence, Predictive modellingAbstract
Accurately predicting particulate matter, 2.5 microns or less in diameter (PM2.5), concentrations is imperative to the future of public health and environmental policies. Machine learning models incorporating spatial and temporal datasets to predict PM2.5 concentrations are often limited by data availability and poor-resolution satellite imagery. In this study, we present multiple predictive models designed for generalized PM2.5 predictions, the output of which has been utilized for different spatial locations. Using Random Forest (RF) and Extreme Gradient Boost (XGB) algorithms, these predictive models follow a multidisciplinary approach using Moderate Resolution Imaging Spectroradiometer Aerosol optical depth (MODIS AOD) and surface datasets (relative humidity, barometric pressure, outdoor temperature, wind speed and wind direction). Models are trained and validated based on historical data to evaluate the impact of training data variability and quantity on the predictive performance of RF and XGB models for PM2.5 concentrations. Using MODIS AOD alone yielded weak predictive performance, with average R2 values ranging from -0.06 to 0.07 across the three urban areas (Washington, D.C., Boston, and New York City), highlighting its limited capability. The integration of meteorological data (temperature, wind speed, wind direction, relative humidity, and barometric pressure) along with MODIS AOD significantly improved the model performance. RF models achieved R² values of 0.30–0.62, while XGB models had R² values of 0.25–0.63, with corresponding RMSE values reduced by 20–30% relative to AOD-only models. Feature importance analysis revealed that PM2.5 predictions were most strongly influenced by temperature (average importance of 0.21), wind speed (0.20), and wind direction (0.15). MODIS AOD exhibited moderate importance (≈0.12), indicating that although satellite-based aerosol observations contributed to the predictions, ground-based meteorological variables remained the primary drivers. These quantitative results highlighted that combining satellite observations with meteorological measurements substantially enhanced PM2.5 predictive accuracy, informing urban planning, environmental policy, and public health interventions to better protect vulnerable populations.
Downloads
References
American Lung Association, 2024. State of the air: District of Columbia. https://www.lung.org/research/sota/city-rankings/states/district-of-columbia/district-of-columbia (Accessed on August 29, 2024).
Bărbulescu, A., Dumitriu, C.S., Ilie, I., Barbeş, S.-B., 2022. Influence of anomalies on the models for nitrogen oxides and ozone series. Atmosphere, 13, 558. https://doi.org/10.3390/atmos13040558
Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). Association for Computing Machinery, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
Chu, Y., Liu, Y., Li, X., Liu, Z., Lu, H., Lu, Y., Mao, Z., Chen, X., Li, N., Ren, M., Liu, F., Tian, L., Zhu, Z., Xiang, H., 2016. A review on predicting ground PM2.5 concentration using satellite aerosol optical depth. Atmosphere, 7, 129. https://doi.org/10.3390/atmos7100129
Cohen, A.J., Ross Anderson, H., Ostro, B., Pandey, K.D., Krzyzanowski, M., Künzli, N., Gutschmidt, K., Pope, A., Romieu, I., Samet, J.M., Smith, K., 2005. The global burden of disease due to outdoor air pollution. Journal of Toxicology and Environmental Health, Part A, 68, 1301–1307. https://doi.org/10.1080/15287390590936166
Daniels, J., Liang, L., Benedict, K.B., Brahney, J., Rangel, R., Weathers, K.C., Ponette-González, A.G., 2024. Satellite-based aerosol optical depth estimates over the continental U.S. during the 2020 wildfire season: Roles of smoke and land cover. Science of The Total Environment, 921, 171122. https://doi.org/10.1016/j.scitotenv.2024.171122
Díaz-Uriarte, R., Alvarez de Andres, S., 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3. https://doi.org/10.1186/1471-2105-7-3
Di, Q., Koutrakis, P., Lyapustin, A., Wang, Y., Schwartz, J., 2016. Assessing PM2.5 exposures with high spatiotemporal resolution across the continental United States. Environmental Science & Technology, 50, 4712–4721. https://doi.org/10.1021/acs.est.5b06121
Feng, S., Gao, D., Liao, F., Zhou, F., Wang, X., 2016. The health effects of ambient PM2.5 and potential mechanisms. Ecotoxicology and Environmental Safety. 128, 67–74. https://doi.org/10.1016/j.ecoenv.2016.01.030
Gupta, P., Christopher, S.A., 2008. Seven year particulate matter air quality assessment from surface and satellite measurements. Atmospheric Chemistry and Physics, 8, 3311–3324. https://doi.org/10.5194/acp-8-3311-2008
Gupta, P., Christopher, S.A., Wang, J., Gehrig, R., Lee, Y., Kumar, N., 2006. Satellite remote sensing of particulate matter and air quality assessment over global cities. Atmospheric Environment, 40, 5880–5892. https://doi.org/10.1016/j.atmosenv.2006.03.016
Gutiérrez-Avila, I., Arfer, K.B., Carrión, D., Rush, J., Kloog, I., Naeger, A.R., Grutter, M., Páramo-Figueroa, V.H., Riojas-Rodríguez, H., Just, A.C., 2022. Prediction of daily mean and one-hour maximum PM2.5 concentrations and applications in Central Mexico using satellite-based machine-learning models. Journal of Exposure Science & Environmental Epidemiology, 32, 917–925. https://doi.org/10.1038/s41370-022-00471-4
Handschuh, J., Erbertseder, T., Schaap, M., Baier, F., 2022. Estimating PM2.5 surface concentrations from AOD: A combination of SLSTR and MODIS. Remote Sensing Applications: Society and Environment, 26, 100716. https://doi.org/10.1016/j.rsase.2022.100716
Jacob, D.J., Winner, D.A., 2009. Effect of climate change on air quality. Atmospheric Environment, 43, 51–63. https://doi.org/10.1016/j.atmosenv.2008.09.051
Jaffe, D.A., O’Neill, S.M., Larkin, N.K., Holder, A.L., Peterson, D.L., Halofsky, J.E., Rappold, A.G., 2020. Wildfire and prescribed burning impacts on air quality in the United States. Journal of the Air & Waste Management Association, 70, 583–615. https://doi.org/10.1080/10962247.2020.1749731
Josey, K.P., Delaney, S.W., Wu, X., Nethery, R.C., DeSouza, P., Braun, D., Dominici, F., 2023. Air pollution and mortality at the intersection of race and social class. The New England Journal of Medicine, 388, 1396–1404. https://doi.org/10.1056/nejmsa2300523
Karner, A.A., Eisinger, D.S., Niemeier, D.A., 2010. Near-roadway air quality: Synthesizing the findings from real-world data. Environmental Science & Technology, 44, 5334–5344. https://doi.org/10.1021/es100008x
Kaveh, M., Mesgari, M.S., Kaveh, M.A., 2025. Novel evolutionary deep learning approach for PM2.5 prediction using remote sensing and spatial–temporal data: A case study of Tehran. International Journal of Geo-Information, 14, 42. https://doi.org/10.3390/ijgi14020042
Kibirige, G.W., Yang, M.C., Liu, C.L., Chen, M.C., 2023. Using satellite data on remote transportation of air pollutants for PM2.5 prediction in northern Taiwan. PLOS ONE, 18, e0282471. https://doi.org/10.1371/journal.pone.0282471
Kumar, K., Pande, B.P., 2023. Air pollution prediction with machine learning: A case study of Indian cities. International Journal of Environmental Science and Technology, 20, 5333–5348. https://doi.org/10.1007/s13762-022-04241-5
Kumar, N., Chu, A., Foster, A., 2007. An empirical relationship between PM2.5 and aerosol optical depth in Delhi Metropolitan. Atmospheric Environment, 41, 4492–4503. https://doi.org/10.1016/j.atmosenv.2007.01.046
Li, J, An, X., Li, Q., Wang, C., Yu, H., Zhou, X., Geng, Y.A., 2022. Application of XGBoost algorithm in the optimization of pollutant concentration. Atmospheric Research, 276, 106238. https://doi.org/10.1016/j.atmosres.2022.106238
Liu, Y., Franklin, M., Kahn, R., Koutrakis, P., 2007. Using aerosol optical thickness to predict ground-level PM2.5 concentrations in the St. Louis area: A comparison between MISR and MODIS. Remote Sensing of Environment, 107, 33-44. https://doi.org/10.1016/j.rse.2006.05.022
Nath, B., Chowdhury, R., Ni-Meister, W., Mahanta, C., 2022. Predicting the distribution of arsenic in groundwater by a geospatial machine learning technique in the two most affected districts of Assam, India: The public health implications. GeoHealth, 6, e2021GH000585. https://doi.org/10.1029/2021GH000585
Paciorek, C.J., Liu, Y., 2009. Limitations of remotely sensed aerosol as a spatial proxy for fine particulate matter. Environmental Health Perspectives, 117, 904-909. https://doi.org/10.1289/ehp.0800360
Park, Y., Kwon, B., Heo, J., Hu, X., Liu, Y., Moon, T., 2020. Estimating PM2.5 concentration of the conterminous United States via interpretable convolutional neural networks. Environmental Pollution, 256, 113395. https://doi.org/10.1016/j.envpol.2019.113395
Qin, Y., Kim, E., Hopke, P.K., 2006. The concentrations and sources of PM2.5 in metropolitan New York City. Atmospheric Environment, 40, 312–332. https://doi.org/10.1016/j.atmosenv.2006.02.025
Remer, L.A., Kaufman, Y.J., Tanré, D., Mattoo, S., Chu, D.A., Martins, J.V., Li, R., Ichoku, C., Levy, R.C., Kleidman, R.G., Eck, T.F., Vermote, E., Holben, B.N., 2005. The MODIS aerosol algorithm, products, and validation. Journal of the Atmospheric Sciences, 62, 947–973. https://doi.org/10.1175/JAS3385.1
Samad, A., Garuda, S., Vogt, U., Yang. B., 2023. Air pollution prediction using machine learning techniques – An approach to replace existing monitoring stations with virtual monitoring stations. Atmospheric Environment, 310, 119987. https://doi.org/10.1016/j.atmosenv.2023.119987
Tessum, C.W., Apte, J.S., Goodkind, A.L., Muller, N.Z., Mullins, K.A., Paolella, D.A., Polasky, S., Springer, N.P., Thakrar, S.K., Marshall, J.D., Hill, J.D., 2019. Inequity in consumption of goods and services adds to racial–ethnic disparities in air pollution exposure. Proceedings of the National Academy of Sciences U.S.A., 116, 6001–6006. https://doi.org/10.1073/pnas.1818859116
van Donkelaar, A., Martin, R.V., Brauer, M., Kahn, R., Levy, R., Verduzco, C., Villeneuve, P.J., 2010. Global estimates of ambient fine particulate matter concentrations from satellite-based aerosol optical depth. Environmental Health Perspectives, 118, 847–855. https://doi.org/10.1289/ehp.0901623
Wong, P.Y., Su, H.J., Lee, H.Y., Chen, Y.C., Hsiao, Y.P., Huang, J.W., Teo, T.A., Wu, C.D., Spengler, J.D., 2021. Using land-use machine learning models to estimate daily NO2 concentration variations in Taiwan. Journal of Cleaner Production, 317, 128411. https://doi.org/10.1016/j.jclepro.2021.128411
Zhang, C., Ma, Y., 2012. Ensemble Machine Learning: Methods and Applications. Springer Publishing Company. https://doi.org/10.1007/978-1-4419-9326-7
Zheng, M., Liu, F., Wang, M., 2025. Assessing the COVID-19 lockdown impact on global air quality: A transportation perspective. Atmosphere, 16, 113. https://doi.org/10.3390/atmos16010113
Zheng, T., Bergin, M., Wang, G., Carlson, D., 2021. Local PM2.5 hotspot detector at 300 m resolution: A random forest–convolutional neural network joint model jointly trained on satellite images and meteorology. Remote Sensing, 13, 1356. https://doi.org/10.3390/rs13071356

Downloads
Published
Data Availability Statement
The data that supports this research will be shared upon reasonable request to the corresponding authors.
Issue
Section
License
Copyright (c) 2025 Jada Macharie, Wenge Ni-Meister, Maddalena Romano

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors retain the copyright to their work and grant the journal and its publisher (Enviro Mind Solutions) a non-exclusive license to publish and distribute the work freely.