Development of a machine learning model for early pulmonary tuberculosis diagnosis using blood test biomarkers
Liangqiong Chen, Cui Yang, Yefeng Dong, Ruihan Ge, Jun Xu, Rongman Xu, Haitao Zhang, Deping Dong, et al. (12 authors)
BMC Infectious Diseases · 2025-11
Abstract
Tuberculosis (TB) is a major global health threat, causing 10.6 million new cases and 1.3 million deaths in 2022. Early diagnosis is crucial, but current methods are often costly and slow for resource-limited settings. This study aimed to develop a rapid, low-cost diagnostic tool using routine blood indicators. We retrospectively analyzed data from 728 TB patients and 2,718 healthy controls. The training set was balanced using the ROSE technique. We trained seven machine learning models, using LASSO regression and forward selection to identify optimal features. The final model was interpreted with SHAP and deployed as an interactive Shiny application. The Gradient Boosting Machine (GBM) model performed optimally on the test set (AUC = 0.831, specificity = 0.855, sensitivity = 0.644). SHAP analysis identified platelet-to-lymphocyte ratio (PLR), monocyte-to-lymphocyte ratio (MLR), and platelet distribution width (PDW) as key predictors. Lowering the classification threshold to 0.24 increased sensitivity to 83.6% (specificity 59.9%), demonstrating its screening potential. An interactive web application was developed to enhance clinical utility. This study delivers a validated GBM model using routine blood tests as a cost-effective TB screening tool. Its high specificity can reduce unnecessary follow-up tests. The model’s core predictors are interpretable and provide biological insights into the inflammatory response in TB. The accompanying Shiny app increases accessibility, making it a promising tool for resource-limited settings. We propose a phased diagnostic strategy: use this model with a low threshold (0.24) for high-sensitivity initial screening, followed by confirmatory molecular testing for positive cases. This approach balances high detection rates with resource optimization. Future work should include prospective validation with cohorts including other respiratory diseases. The GBM model constructed in this study, based on routine blood indicators, demonstrated high specificity (85.5%) and favorable diagnostic performance (AUC = 0.831) in the test set. SHAP analysis identified PLR, MLR, and PDW as key predictors, reflecting the inflammatory-immune characteristics of TB. The freely accessible Shiny tool is suitable for primary screening and facilitates phased screening strategies in high-burden settings.
MeSH terms
- Machine learning
- Artificial intelligence
- Medicine
- Test set
- Lasso (programming language)
- Tuberculosis
- Boosting (machine learning)
- Tropical medicine
- Blood test
- Gradient boosting
- Medical microbiology
- Training set
- Regression