Ecology and Environmental Sciences ›› 2026, Vol. 35 ›› Issue (6): 976-985.DOI: 10.16258/j.cnki.1674-5906.2026.06.014

• Research Article [Environmental Science] • Previous Articles     Next Articles

Machine Learning Supported Determination for the Main Controlling Factors of Heavy Metal Pollution in the Ganjiang River

ZHOU Jiahao1,2(), ZHANG Pei1,2, PENG Yuwen1,2, ZHONG Songxiong3, ZOU Jianping1,2, HOU Dongmei1,2,*()   

  1. 1 School of Environmental and Chemical Engineering, Nanchang Hangkong University, Nanchang 330063, P. R. China
    2 National-Local Joint Engineering Research Center of Heavy Metals Pollutants Control and Resource Utilization, Nanchang Hangkong University, Nanchang 330063, P. R. China
    3 Institute of Eco-Environmental and Soil Science, Guangdong Academy of Sciences, Guangzhou 510650, P. R. China
  • Received:2025-10-31 Revised:2026-03-29 Accepted:2026-05-12 Online:2026-06-18 Published:2026-06-08

基于机器学习识别赣江流域重金属污染主控因素

周家豪1,2(), 张沛1,2, 彭煜文1,2, 钟松雄3, 邹建平1,2, 侯冬梅1,2,*()   

  1. 1 南昌航空大学环境与化学工程学院江西 南昌 330063
    2 南昌航空大学/重金属污染物控制与资源化国家地方联合工程研究中心江西 南昌 33063
    3 广东省科学院生态环境与土壤研究所广东 广州 510650
  • 通讯作者: * 侯冬梅,E-mail: hou_dong_mei@126.com
  • 作者简介:周家豪(2003年生),男,硕士研究生,主要从事流域重金属迁移转化机理研究。E-mail: jiahaozz03@163.com
  • 基金资助:
    国家重点研发计划项目(2022YFD1700802);江西省自然基金面上项目(20252BAC240331);自然资源部离子型稀土资源与环境重点实验室开放基金课题(2024IRERE401)

Abstract:

Ganjiang River is the largest water system of the Poyang Lake, which plays a vital role in ecological balance and economic development. However, mining activities have posed significant challenges to water quality and ecological environment in this area, with heavy metal contamination emerging as a particularly acute issue that demands urgent resolution. Although the detrimental impacts of heavy metal pollution on ecosystems are well documented, a comprehensive evaluation covering multiple heavy metals across the Ganjiang River is lacking, leaving significant research gaps. Hence, accurate prediction of heavy metal concentrations and identification of the key factors are essential for pollution management. In this study, 1000 sets of heavy metal concentrations and environmental factor data were systematically collected by literature review and laboratory analysis. Eight environmental factors and fifteen heavy metal concentrations were selected as model input variables, including pH, oxidation-reduction potential (ORP), dissolved oxygen (DO), total organic carbon (TOC), electrical conductivity (EC), total nitrogen (TN), total phosphorus (TP), and potassium ions (K+), as well as fifteen metal elements such as Al, Cr, Co, Ni, Mn, Fe, Cu, and Zn. The predictive performance for the concentrations of Cd, As and Pb were compared across five advanced machine learning models (LR (Linear Regression), DT (Decision Tree), RF (Random Forest), XGBoost (Extreme Gradient Boosting), SVM (Support Vector Machine)). The data were randomly divided into a training set (80%) and a test set (20%) for training the models. All the models were optimized using 5-fold cross-validation to obtain the optimal hyperparameters. The coefficient of determination (R2), mean absolute error (σMAE), and root mean square error (σRMSE) were employed to evaluate the regression performance. For the model with the best performance, SHapley Additive exPlanations (SHAP) and permutation feature importance analysis methods were employed to rank the importance of the input parameters and identify the most significant factors related to heavy metal concentration. The results showed that 1) the variations in concentrations of the heavy metal ions and environmental factors in Ganjiang River were substantial. The pH value ranged from 1.95 to 9.64, with an average pH of 5.91, slightly lower than natural waters. This might be due to the discharge of the acid mine drainage in mining area. Besides that, particular attention should be paid to TN and TP. In some sampling sites, the concentrations of TN and TP reached as high as 7.918 mg·L−1 and 0.98 mg·L−1, respectively. Similarly, when compared to the Class Ⅳ criteria for surface water (GB 3838—2002), the contents of Cd, As, Pb, Cr, Fe, Mn and Hg were significantly higher than the standard limits. Moreover, the calculated coefficients of variation (CV) of Cd, As, and Pb were 130%, 78%, and 111%, respectively. These results indicated that heavy metal contents in the Ganjiang River are strongly influenced by anthropogenic activities and pose potential ecological risks. 2) Linear and nonlinear models differ substantially in their predictive performance for heavy metal concentrations. Linear regression models cannot capture the complex nonlinear relationships between indicators and heavy metal concentration. Significant discrepancies were observed between the predicted and measured values for some samples. Nonlinear models have some advantages in mining data information as well as identifying and modeling nonlinear relationships. For predicting Cd concentrations, RF and SVM performed similarly, with SVM showing slightly higher accuracy (R2, 0.981; σRMSE, 0.163; σMAE, 0.0983), followed by XGBoost and DT, while the LR model exhibited the lowest accuracy. Similarly, for Pb concentrations predictions, SVM achieved the highest accuracy, and the values of R2, σRMSE, σMAE were 0.971, 0.163, and 0.0983, respectively. SVM is a small-sample machine learning method suited for solving nonlinear regression problems. It can effectively address classification and regression tasks with high-dimensional features by relying on support vectors instead of the entire dataset, thus allowing for strong performance even with limited samples. By contrast, RF achieved high accuracy for predicting As concentrations, with R2 values of 0.963, σRMSE values of 0.244, σMAE values of 0.126. RF avoids the risks of overfitting issues associated with a single decision tree and enhances model robustness through random features and samples selection during the training process. Integrating the evaluation metrics of R2, σRMSE, and σMAE, the optimal prediction models for Cd and Pb concentrations were identified as SVM, while RF excelled for predicting As concentration. LR showed poorer performance across all prediction models for heavy metal concentration. 3) Based on the variable importance assessment from the aforementioned model, it was found that TOC, pH, and metal content significantly affect the heavy metal concentration in the Ganjiang river basin. TOC was identified as the primary governing factor for Cd pollution, with which it exhibited a significant positive correlation. In contrast, Mg was identified as the principal influencer of As, despite the absence of a statistically significant correlation between them. Furthermore, the concentration of Pb was predominantly regulated by pH, demonstrating a positive correlation with it. 4) In the future, with the advancement of computer science and technology, multi-model integration frameworks are gradually coming into the spotlight. Techniques such as multi-model stacking and mixing can fully leverage the strengths of various machine learning models to produce more powerful predictive models. Additionally, combining physics-based models with data-driven approaches can enhance the understanding and prediction capabilities of the models while also lowering the barrier to their use. Integrating ecological assessments with pollution studies will provide a more holistic understanding of the impacts of heavy metal contamination on the long-term sustainability of the Ganjiang river basin. Identifying effective mitigation strategies and regulations to reduce heavy metal pollution in the Ganjiang river basin is crucial. Future research should involve multidisciplinary approaches and include local communities in conservation efforts, raising awareness of the importance of the Ganjiang river basin. Collaboration among scientists, policymakers, and local stakeholders is key to successful conservation initiatives. 5) In conclusion, while the current study offers valuable insights into heavy metal contamination, recognizing its limitations is essential. This research develops a machine learning framework to ascertain the heavy metal concentration and assesses the performance of five sophisticated machine learning algorithms in predicting these concentrations. The findings indicate that machine learning models, notably SVM and RF, demonstrate high precision and robustness in determining heavy metal concentration. This study offers a novel approach and perspective for future determinations of heavy metal concentrations.

Key words: Ganjiang River Basin, heavy metals ions, machine learning, model comparison, concentration prediction, dominant factors

摘要:

矿区开采导致赣江流域水质和生态环境面临着严重挑战,其中重金属污染问题尤为突出,亟待解决。精准预测流域重金属的质量浓度及主控因子是污染治理的关键所在。本研究系统收集了赣江流域内1000组环境因子及重金属质量浓度数据,采用线性回归(LR)、决策树(DT)、随机森林(RF)、极限梯度提升(XGBoost)以及支持向量机(SVM)5种不同的机器学习算法对该流域重金属的质量浓度进行了预测,并通过Shapley可加性解释分析法识别其污染主控因子。结果显示,1)不同模型对重金属质量浓度预测的精准度存在较大差异,RF模型对As的预测性能最优,测试集R2为0.963,σRMSEσMAE分别为0.244和0.126;SVM模型对Cd与Pb的预测效果最佳,其中Cd测试集R2为0.981,σRMSEσMAE分别为0.163和0.0983;Pb测试集R2为0.971,σRMSEσMAE分别为0.301和0.155;2)特征重要性及相关性分析表明,TOC是影响Cd质量浓度的主控因子,与Cd呈正相关关系(p<0.001);As质量浓度主要受Mg影响,但二者无显著相关性(p>0.05);Pb则主要受pH影响,且两者呈正相关(p<0.001)。通过机器学习方法明确了赣江流域典型重金属污染的主控因子,为该区域的环境监测、污染控制及生态可持续发展提供了科学参考。

关键词: 赣江流域, 重金属, 机器学习, 模型比较, 质量浓度预测, 主控因子

CLC Number: