TB Research

Machine Learning Models for Predicting Latent Tuberculosis Infection Risk in Close Contacts of Patients with Pulmonary Tuberculosis — Henan Province, China, 2024

Sun Dingyong, Wu Xuan, Zhang Yanqiu, Wang Weidong, He Mengya, Diao Linqi

China CDC Weekly · 2026-01

Abstract

Introduction: We explored risk factors for latent tuberculosis infection (LTBI) and developed a risk prediction model using machine learning algorithms. Methods: Patients with active pulmonary TB in months 3 to 6 of anti-TB treatment in Henan Province, China, July-September 2024 were selected as index cases. Close contacts identified through epidemiological investigation underwent tuberculin-purified protein derivative testing to determine LTBI status. Face-to-face questionnaires were conducted to collect epidemiological data. The dataset was divided into training and testing sets (6:4), using a fixed random seed. Five models - logistic regression (LR), decision tree (DT), random forest (RF), support vector machines (SVM), and multilayer perceptron (MLP) - were trained and evaluated using the mean squared error (MSE) and coefficient of determination. The test set was subjected to external validation. Receiver operating characteristic curve analysis, area under the curve (AUC), and F1-scores were used to quantify predictive performance. Results: Among 795 close contacts, LTBI prevalence was 401 (50.5%). By MSE, models ranked: SVM (0.121), RF (0.165), DT (0.197), LR (0.229), and MLP (0.233). SVM identified five key predictors: contact type of index case, key population classification, residential area, frequency of participation in group activities, and etiological results. Internal validation showed strong performance (AUC=0.921, F1=0.858), whereas external validation showed moderate performance (AUC=0.752, F1=0.694). Conclusion: The SVM model incorporating contact type of index case, key population classification, residential area, frequency of group activity participation, and etiological results demonstrated robust predictive value for LTBI risk. This model shows promise for the targeted screening and management of high-risk populations.

MeSH terms

  • Medicine
  • Machine learning
  • Logistic regression
  • Support vector machine
  • Random forest
  • Artificial intelligence
  • Multilayer perceptron
  • Receiver operating characteristic
  • Population
  • Latent tuberculosis
  • Test set
  • Tuberculosis
  • Decision tree
  • Pulmonary tuberculosis
  • Statistics
  • Predictive modelling
  • Etiology
  • Epidemiology
  • Risk assessment
  • Index (typography)
  • Latent class model