TB Research

Integrative genomic sequencing and machine learning approaches for tuberculosis drug resistance, diagnostic tool development, and transmission analysis

L. L. Wang

LSHTM Research Online (London School of Hygiene and Tropical Medicine) · 2025-01

Abstract

Tuberculosis, caused by bacteria in the Mycobacterium tuberculosis complex, is a major global public health issue. The emergence of drug-resistant tuberculosis strains presents an alarming challenge to disease control, casting a shadow over the ambitious elimination goals set by the World Health Organization. Researchers are increasingly turning to advanced sequencing techniques like High Throughput Sequencing, to gain deeper insights into the biology of Mycobacterium tuberculosis. The generation of omics data of can inform the design of diagnostics, therapies, and vaccine development. The Mycobacterium tuberculosis genome (size 4.4 Mbp) is characterised by a high GC content (65%), encompassing 4,111 genes. The underlying genetic mutations involved in escalating drug-resistant tuberculosis can also be revealed by whole genome sequencing, leading to personalised medicine applications. High tuberculosis burden countries, such as the Philippines, are now adopting whole genome sequencing technologies to gain insights into the genetic makeup of circulating Mycobacterium tuberculosis strains, revealing transmission clusters and the presence of genotypic drug resistance, thereby informing infection control and surveillance decision- making. A national-level genomic study in the Philippines uncovered high transmission rates, inadequate management of resistant cases, and the first reported instance of bedaquiline resistance in the region. To improve resistance diagnostics, a Gaussian mixture model-based method was developed to detect mixed infections and assign resistance mutations to individual strains using large-scale public isolate data. Recognising the cost barriers of whole genome sequencing, a flexible amplicon design tool was created to enable targeted sequencing guided by mutation frequency. For drug resistance prediction, deep learning models were applied to forecast minimum inhibitory concentrations, extends subtle resistance patterns beyond binary classification. A clinical decision support model using XGBoost was also developed to predict treatment outcomes from patient records. The model was optimised for real-world constraints, handling missing data and designed for conditions with limited resources. In exploring therapeutic innovation, a recurrent neural network was used to classify and generate antimicrobial peptides tailored to tuberculosis. Separately, a graph neural network was developed to identify positively selected mutations from phylogenetic trees, offering a scalable tool for evolutionary surveillance. Together, this work integrates statistical modelling, software development, and artificial intelligence to deliver practical tools and insights for tuberculosis diagnostics, treatment, and long-term control.

MeSH terms

  • Mycobacterium tuberculosis
  • Tuberculosis
  • Transmission (telecommunications)
  • Genome
  • Computational biology
  • Drug resistance
  • Genomics
  • Whole genome sequencing
  • DNA sequencing
  • Biology
  • Precision medicine
  • Public health
  • Drug discovery
  • Personalized medicine
  • Genetics
  • Mutation
  • Mycobacterium tuberculosis complex
  • Personal genomics