Optimization of machine learning models for assessing the risk of tuberculosis spread
Dmytro Hospodarchuk, Denys Nevinskyi, Dmytro Martjanov, Yaroslav Vyklyuk, Ihor Semianiv
Management of Development of Complex Systems · 2025-03
Abstract
Tuberculosis (TB) remains one of the most pressing public health issues, especially in developing countries. The high incidence rate and the spread of multidrug-resistant strains of “Mycobacterium tuberculosis” pose significant challenges to modern medicine. India is one of the countries with the highest TB burden, making the optimization of disease spread prediction methods crucial for the effective implementation of prevention and treatment measures. The application of machine learning (ML) methods enables the automation of large-scale data analysis and the identification of key risk factors. This study aims to develop effective machine learning models for assessing the risk of TB spread in India based on socio-economic, demographic, and medical factors. A dataset containing 148 records from the period 2019–2022, categorized by Indian states, was used for analysis. Key variables included the number of detected TB cases, treatment success rates, mortality rates among patients, and the tobacco and alcohol consumption status of patients. The study involved data preprocessing, correlation analysis, and the application of machine learning methods. Several models were tested: linear regression, regularized models (Lasso and Ridge), support vector machine (SVM), k-nearest neighbors (KNN), random forest, and decision tree. The analysis showed that the best accuracy was achieved by the SVM model with optimized parameters, demonstrating the highest coefficient of determination and the lowest root mean square error. The comparison of other models revealed significant advantages of SVM over linear regression and decision trees, which exhibited low generalization capability. The most influential factors in predicting TB spread were determined using the Permutation Importance method. The most significant factors included geographic location (state), the number of registered TB cases among children, the number of women with TB, the mortality rate among patients, and the infrastructure available for treating drug-resistant TB. It was also found that social factors, such as tobacco and alcohol consumption among patients, influence the disease spread, although their contribution is less significant. The study confirmed the effectiveness of applying machine learning methods to predict tuberculosis spread. The optimized SVM model provided the best accuracy and generalization capability. Factor importance analysis revealed that regional characteristics, demographic indicators, and mortality rates have the greatest impact on disease spread. The obtained results can be used to improve TB control strategies, particularly through targeted interventions in high-risk regions. The use of ML methods enhances disease control efficiency, which is an essential step in the global fight against tuberculosis.
MeSH terms
- Tuberculosis
- Computer science
- Machine learning
- Artificial intelligence