Abstract:
This research has utilized a combination of ground-based and satellite data along with National
Aeronautics and Space Administration (NASA) POWER (Prediction Of Worldwide Energy
Resources) (NP) data for the prediction of PM2.5 concentration. Unlike earlier studies, this work
uses Moderate Resolution Imaging Spectroradiometer (MODIS) surface reflectance data, to
predict PM2.5 which fills a gap in the existing literature. The study mainly focuses on Islamabad
Capital Territory where rapid development has led to increased air pollution leading to significant
health implications. Ground-based sensor provides local air quality measurement while satellite
observation facilitates broader spatial coverage. By integrating these three data sources, ensemble
learning models specifically Random Forest (RF), Extra Trees (ET) and Gradient Boosting (GB)
were trained and evaluated to enhance the accuracy of PM2.5 predictions over study area. The
models utilized various input features including ground-based parameters, wind speed data and
MODIS surface reflectance for 3 spectral bands. The datasets were enhanced with polynomial
features and divided into training and testing sets with 80/20 split. A 5-fold Cross-validation was
used to assess model robustness, after which the models were tested on 20% data. In addition, both
spatial and temporal predictions were conducted to assess the model’s performance. The results on
the 80/20 split show that GB (RMSE = 14.87, MAE = 9.10 and R2 = 0.73) have the best
performance followed by ET. Detailed results of spatial and temporal prediction on different
location and timeframe are discussed in this study. Furthermore, this study reveals the integration
of ML for effective use of air pollution monitoring and prediction.