Clinical data analysis research on tuberculosis based on machine learning
Kang R, Liu H, Lei Q, Li T
Frontiers in medicine · 2026-04
Abstract
Background Tuberculosis (TB) remains a global health challenge, with heterogeneous treatment outcomes despite standardized protocols. Traditional statistical models struggle with high-dimensional clinical data, necessitating advanced machine learning (ML) approaches. Objective To analyze clinical data from 467 pulmonary TB patients and construct a predictive model using multiple ML algorithms. Methods A prospective cohort of 467 patients (218 intervention, 249 control) was enrolled from Xi'an Chest Hospital. Medical ratio features (ALT/AST, CD4/CD8) and polynomial interaction terms (e.g., RBC × ALT) were constructed. Recursive feature elimination (RFE) selected 60 predictive factors from an expanded 80-dimensional feature space. Fourteen ML algorithms were systematically compared, with hyperparameters optimized via grid search. Performance was assessed using five-fold cross-validation R 2 , RMSE, and MAE. Results LightGBM achieved the highest initial predictive performance ( R 2 = 0.1829, RMSE = 139.23). Following hyperparameter optimization, Random Forest attained a marginally improved R 2 of 0.1867 with comparable error metrics and enhanced clinical interpretability, serving as the final reference model. Feature engineering expanded the feature space from 33 to 80, with 60 optimal features retained. Conclusion The optimized Random Forest model ( R 2 = 0.1867) demonstrates moderate accuracy and clinical interpretability, supporting its potential as a decision-support tool for TB treatment optimization. Pharmacist-led therapeutic drug monitoring (TDM) further enhances individualized therapy. Future work requires multi-center validation and radiomics integration to improve predictive performance in severe cases. Clinical trial registration Registration Platform: Chinese Clinical Trial Registry [https://www.chictr.org.cn/], identifier [ChiCTR2300074328].