Predicting multi-drug resistant tuberculosis using machine learning on genomic and clinical data
Saxena K, Mary SSC, Ghosh P, Mohan CR, Naved M, Madanan M
The Indian journal of tuberculosis · 2025-11
Abstract
Background Tuberculosis (TB) is still the largest cause of death in the world, especially in countries with low and medium incomes. Multi-Drug Resistant Tuberculosis (MDR-TB), which is resistant to isoniazid and rifampicin, is a new type of tuberculosis that makes treatment and disease control much harder. Conventional diagnostic methods, like quantitative drug resistance tests and genetic tools, take a lot of time and resources, and they aren't always available in places with few resources. New developments in whole-genome sequencing and the availability of clinical data make it possible to use machine learning to quickly and accurately find MDR-TB. Methods This study uses genetic and clinical data from a freely available set of about 5000 TB patient samples, which includes cases that are drug-susceptible, MDR, and highly drug-resistant (XDR). We use feature selection and normalization methods to prepare whole-genome sequencing data and clinical factors. Logistic Regression, Random Forest, Support Vector Machine, Gradient Boosting Machine, and Deep Neural Networks are built and tested using stratified cross-validation. Metrics like accuracy, precision, recall, F1-score, and AUC-ROC are used to measure how well a model works. Results The results showed that the Gradient Boosting and Deep Neural Network models were the most accurate at predicting (92.3 % and 93.1 %, respectively) and had the best AUC-ROC scores (94.7 % and 95.4 %, respectively). It was also shown that these models were better at finding MDR-TB cases. While genomic data was introduced to clinical records, the version have become more strong and correct than while genomic statistics was used alone. Key genetic changes and clinical factors that have an effect on drug resistance were found through feature value studies. Conclusion Using genetic and clinical data in machine learning algorithms to detect MDR-TB is promising, fast, and accurate, especially in low-resource areas. Future research should incorporate more data kinds, simplify models, and expand datasets to increase diagnostic accuracy and practicality.
MeSH terms
- Humans
- Mycobacterium tuberculosis
- Tuberculosis, Multidrug-Resistant
- Antitubercular Agents
- Genomics
- Machine Learning
- Support Vector Machine
- Whole Genome Sequencing
- Neural Networks, Computer