Analyzing the Effects of Data Variability and Quantity on Predicting Particulate Matter (PM2.5) Concentrations: Insights from a Machine Learning Approach

Authors

  • Jada Macharie Department of Geography and Environmental Science, Hunter College of the City University of New York, New York, NY 10021, USA image/svg+xml
  • Wenge Ni-Meister Department of Geography and Environmental Science, Hunter College of the City University of New York, New York, NY 10021, USA image/svg+xml https://orcid.org/0000-0001-9723-2075
  • Maddalena Romano Department of Geography and Environmental Science, Hunter College of the City University of New York, New York, NY 10021, USA image/svg+xml

DOI:

https://doi.org/10.63697/jeshs.2025.10042

Keywords:

PM2.5 concentration, MODIS AOD, Air quality management, Artificial intelligence, Predictive modelling

Abstract

Accurately predicting particulate matter, 2.5 microns or less in diameter (PM2.5), concentrations is imperative to the future of public health and environmental policies. Machine learning models incorporating spatial and temporal datasets to predict PM2.5 concentrations are often limited by data availability and poor-resolution satellite imagery. In this study, we present multiple predictive models designed for generalized PM2.5 predictions, the output of which has been utilized for different spatial locations. Using Random Forest (RF) and Extreme Gradient Boost (XGB) algorithms, these predictive models follow a multidisciplinary approach using Moderate Resolution Imaging Spectroradiometer Aerosol optical depth (MODIS AOD) and surface datasets (relative humidity, barometric pressure, outdoor temperature, wind speed and wind direction). Models are trained and validated based on historical data to evaluate the impact of training data variability and quantity on the predictive performance of RF and XGB models for PM2.5 concentrations. Using MODIS AOD alone yielded weak predictive performance, with average R2 values ranging from -0.06 to 0.07 across the three urban areas (Washington, D.C., Boston, and New York City), highlighting its limited capability. The integration of meteorological data (temperature, wind speed, wind direction, relative humidity, and barometric pressure) along with MODIS AOD significantly improved the model performance. RF models achieved R² values of 0.30–0.62, while XGB models had R² values of 0.25–0.63, with corresponding RMSE values reduced by 20–30% relative to AOD-only models. Feature importance analysis revealed that PM2.5 predictions were most strongly influenced by temperature (average importance of 0.21), wind speed (0.20), and wind direction (0.15). MODIS AOD exhibited moderate importance (≈0.12), indicating that although satellite-based aerosol observations contributed to the predictions, ground-based meteorological variables remained the primary drivers. These quantitative results highlighted that combining satellite observations with meteorological measurements substantially enhanced PM2.5 predictive accuracy, informing urban planning, environmental policy, and public health interventions to better protect vulnerable populations.

Downloads

Download data is not yet available.

Author Biography

  • Wenge Ni-Meister, Department of Geography and Environmental Science, Hunter College of the City University of New York, New York, NY 10021, USA

    Dr. Wenge Ni-Meister is a Professor of Geography and Environmental Science at Hunter College of The City University of New York. She received a B.Sc. and M.Sc. in Meteorology and Climatology in China, a M.Sc. in Land-Atmosphere Interactions from the University of Connecticut, and a Ph.D. in Remote Sensing Science of Terrestrial Ecosystem from Boston University. She worked as a research scientist at the University of Maryland and NASA Goddard Space Flight Center before joining Hunter College. She has been an investigator for numerous NASA projects, including one on developing a global dynamic terrestrial ecosystem model for coupling with Global Circulation Models (GCMs), a fusion of remotely sensed 3D terrestrial ecosystem structure with a dynamic global terrestrial ecosystem model for improved estimates of carbon stocks and land-atmosphere exchanges and one on the fusion of NASA satellite data with hydrological models for improved land surface soil moisture estimate.

References

American Lung Association, 2024. State of the air: District of Columbia. https://www.lung.org/research/sota/city-rankings/states/district-of-columbia/district-of-columbia (Accessed on August 29, 2024).

Bărbulescu, A., Dumitriu, C.S., Ilie, I., Barbeş, S.-B., 2022. Influence of anomalies on the models for nitrogen oxides and ozone series. Atmosphere, 13, 558. https://doi.org/10.3390/atmos13040558

Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). Association for Computing Machinery, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785

Chu, Y., Liu, Y., Li, X., Liu, Z., Lu, H., Lu, Y., Mao, Z., Chen, X., Li, N., Ren, M., Liu, F., Tian, L., Zhu, Z., Xiang, H., 2016. A review on predicting ground PM2.5 concentration using satellite aerosol optical depth. Atmosphere, 7, 129. https://doi.org/10.3390/atmos7100129

Cohen, A.J., Ross Anderson, H., Ostro, B., Pandey, K.D., Krzyzanowski, M., Künzli, N., Gutschmidt, K., Pope, A., Romieu, I., Samet, J.M., Smith, K., 2005. The global burden of disease due to outdoor air pollution. Journal of Toxicology and Environmental Health, Part A, 68, 1301–1307. https://doi.org/10.1080/15287390590936166

Daniels, J., Liang, L., Benedict, K.B., Brahney, J., Rangel, R., Weathers, K.C., Ponette-González, A.G., 2024. Satellite-based aerosol optical depth estimates over the continental U.S. during the 2020 wildfire season: Roles of smoke and land cover. Science of The Total Environment, 921, 171122. https://doi.org/10.1016/j.scitotenv.2024.171122

Díaz-Uriarte, R., Alvarez de Andres, S., 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3. https://doi.org/10.1186/1471-2105-7-3

Di, Q., Koutrakis, P., Lyapustin, A., Wang, Y., Schwartz, J., 2016. Assessing PM2.5 exposures with high spatiotemporal resolution across the continental United States. Environmental Science & Technology, 50, 4712–4721. https://doi.org/10.1021/acs.est.5b06121

Feng, S., Gao, D., Liao, F., Zhou, F., Wang, X., 2016. The health effects of ambient PM2.5 and potential mechanisms. Ecotoxicology and Environmental Safety. 128, 67–74. https://doi.org/10.1016/j.ecoenv.2016.01.030

Gupta, P., Christopher, S.A., 2008. Seven year particulate matter air quality assessment from surface and satellite measurements. Atmospheric Chemistry and Physics, 8, 3311–3324. https://doi.org/10.5194/acp-8-3311-2008

Gupta, P., Christopher, S.A., Wang, J., Gehrig, R., Lee, Y., Kumar, N., 2006. Satellite remote sensing of particulate matter and air quality assessment over global cities. Atmospheric Environment, 40, 5880–5892. https://doi.org/10.1016/j.atmosenv.2006.03.016

Gutiérrez-Avila, I., Arfer, K.B., Carrión, D., Rush, J., Kloog, I., Naeger, A.R., Grutter, M., Páramo-Figueroa, V.H., Riojas-Rodríguez, H., Just, A.C., 2022. Prediction of daily mean and one-hour maximum PM2.5 concentrations and applications in Central Mexico using satellite-based machine-learning models. Journal of Exposure Science & Environmental Epidemiology, 32, 917–925. https://doi.org/10.1038/s41370-022-00471-4

Handschuh, J., Erbertseder, T., Schaap, M., Baier, F., 2022. Estimating PM2.5 surface concentrations from AOD: A combination of SLSTR and MODIS. Remote Sensing Applications: Society and Environment, 26, 100716. https://doi.org/10.1016/j.rsase.2022.100716

Jacob, D.J., Winner, D.A., 2009. Effect of climate change on air quality. Atmospheric Environment, 43, 51–63. https://doi.org/10.1016/j.atmosenv.2008.09.051

Jaffe, D.A., O’Neill, S.M., Larkin, N.K., Holder, A.L., Peterson, D.L., Halofsky, J.E., Rappold, A.G., 2020. Wildfire and prescribed burning impacts on air quality in the United States. Journal of the Air & Waste Management Association, 70, 583–615. https://doi.org/10.1080/10962247.2020.1749731

Josey, K.P., Delaney, S.W., Wu, X., Nethery, R.C., DeSouza, P., Braun, D., Dominici, F., 2023. Air pollution and mortality at the intersection of race and social class. The New England Journal of Medicine, 388, 1396–1404. https://doi.org/10.1056/nejmsa2300523

Karner, A.A., Eisinger, D.S., Niemeier, D.A., 2010. Near-roadway air quality: Synthesizing the findings from real-world data. Environmental Science & Technology, 44, 5334–5344. https://doi.org/10.1021/es100008x

Kaveh, M., Mesgari, M.S., Kaveh, M.A., 2025. Novel evolutionary deep learning approach for PM2.5 prediction using remote sensing and spatial–temporal data: A case study of Tehran. International Journal of Geo-Information, 14, 42. https://doi.org/10.3390/ijgi14020042

Kibirige, G.W., Yang, M.C., Liu, C.L., Chen, M.C., 2023. Using satellite data on remote transportation of air pollutants for PM2.5 prediction in northern Taiwan. PLOS ONE, 18, e0282471. https://doi.org/10.1371/journal.pone.0282471

Kumar, K., Pande, B.P., 2023. Air pollution prediction with machine learning: A case study of Indian cities. International Journal of Environmental Science and Technology, 20, 5333–5348. https://doi.org/10.1007/s13762-022-04241-5

Kumar, N., Chu, A., Foster, A., 2007. An empirical relationship between PM2.5 and aerosol optical depth in Delhi Metropolitan. Atmospheric Environment, 41, 4492–4503. https://doi.org/10.1016/j.atmosenv.2007.01.046

Li, J, An, X., Li, Q., Wang, C., Yu, H., Zhou, X., Geng, Y.A., 2022. Application of XGBoost algorithm in the optimization of pollutant concentration. Atmospheric Research, 276, 106238. https://doi.org/10.1016/j.atmosres.2022.106238

Liu, Y., Franklin, M., Kahn, R., Koutrakis, P., 2007. Using aerosol optical thickness to predict ground-level PM2.5 concentrations in the St. Louis area: A comparison between MISR and MODIS. Remote Sensing of Environment, 107, 33-44. https://doi.org/10.1016/j.rse.2006.05.022

Nath, B., Chowdhury, R., Ni-Meister, W., Mahanta, C., 2022. Predicting the distribution of arsenic in groundwater by a geospatial machine learning technique in the two most affected districts of Assam, India: The public health implications. GeoHealth, 6, e2021GH000585. https://doi.org/10.1029/2021GH000585

Paciorek, C.J., Liu, Y., 2009. Limitations of remotely sensed aerosol as a spatial proxy for fine particulate matter. Environmental Health Perspectives, 117, 904-909. https://doi.org/10.1289/ehp.0800360

Park, Y., Kwon, B., Heo, J., Hu, X., Liu, Y., Moon, T., 2020. Estimating PM2.5 concentration of the conterminous United States via interpretable convolutional neural networks. Environmental Pollution, 256, 113395. https://doi.org/10.1016/j.envpol.2019.113395

Qin, Y., Kim, E., Hopke, P.K., 2006. The concentrations and sources of PM2.5 in metropolitan New York City. Atmospheric Environment, 40, 312–332. https://doi.org/10.1016/j.atmosenv.2006.02.025

Remer, L.A., Kaufman, Y.J., Tanré, D., Mattoo, S., Chu, D.A., Martins, J.V., Li, R., Ichoku, C., Levy, R.C., Kleidman, R.G., Eck, T.F., Vermote, E., Holben, B.N., 2005. The MODIS aerosol algorithm, products, and validation. Journal of the Atmospheric Sciences, 62, 947–973. https://doi.org/10.1175/JAS3385.1

Samad, A., Garuda, S., Vogt, U., Yang. B., 2023. Air pollution prediction using machine learning techniques – An approach to replace existing monitoring stations with virtual monitoring stations. Atmospheric Environment, 310, 119987. https://doi.org/10.1016/j.atmosenv.2023.119987

Tessum, C.W., Apte, J.S., Goodkind, A.L., Muller, N.Z., Mullins, K.A., Paolella, D.A., Polasky, S., Springer, N.P., Thakrar, S.K., Marshall, J.D., Hill, J.D., 2019. Inequity in consumption of goods and services adds to racial–ethnic disparities in air pollution exposure. Proceedings of the National Academy of Sciences U.S.A., 116, 6001–6006. https://doi.org/10.1073/pnas.1818859116

van Donkelaar, A., Martin, R.V., Brauer, M., Kahn, R., Levy, R., Verduzco, C., Villeneuve, P.J., 2010. Global estimates of ambient fine particulate matter concentrations from satellite-based aerosol optical depth. Environmental Health Perspectives, 118, 847–855. https://doi.org/10.1289/ehp.0901623

Wong, P.Y., Su, H.J., Lee, H.Y., Chen, Y.C., Hsiao, Y.P., Huang, J.W., Teo, T.A., Wu, C.D., Spengler, J.D., 2021. Using land-use machine learning models to estimate daily NO2 concentration variations in Taiwan. Journal of Cleaner Production, 317, 128411. https://doi.org/10.1016/j.jclepro.2021.128411

Zhang, C., Ma, Y., 2012. Ensemble Machine Learning: Methods and Applications. Springer Publishing Company. https://doi.org/10.1007/978-1-4419-9326-7

Zheng, M., Liu, F., Wang, M., 2025. Assessing the COVID-19 lockdown impact on global air quality: A transportation perspective. Atmosphere, 16, 113. https://doi.org/10.3390/atmos16010113

Zheng, T., Bergin, M., Wang, G., Carlson, D., 2021. Local PM2.5 hotspot detector at 300 m resolution: A random forest–convolutional neural network joint model jointly trained on satellite images and meteorology. Remote Sensing, 13, 1356. https://doi.org/10.3390/rs13071356

Graphical abstract

Published

2025-08-30

Data Availability Statement

The data that supports this research will be shared upon reasonable request to the corresponding authors.

Issue

Section

Articles

How to Cite

(1)
Macharie, J.; Ni-Meister, W.; Romano, M. Analyzing the Effects of Data Variability and Quantity on Predicting Particulate Matter (PM2.5) Concentrations: Insights from a Machine Learning Approach. J. Environ. Sci. Health Sustain. 2025, 1 (2), 144–159. https://doi.org/10.63697/jeshs.2025.10042.

Similar Articles

1-10 of 13

You may also start an advanced similarity search for this article.