TB Research

Peering into the dark matter of the \mbox{\emph{M. tuberculosis}} genome using long-read sequencing

Marin, Maximillian Gabriel

Digital Access to Scholarship at Harvard (DASH) (Harvard University) · 2023-01

Abstract

Tuberculosis (TB) is an infectious disease responsible for over 1 million deaths per year. TB primarily manifests as a pulmonary infection, but can also disseminate throughout the body. The causative agent of TB is the bacteria of the \textit{Mycobacterium tuberculosis} complex (MTBC). Understanding the evolution of the MTBC is critical for developing more effective TB vaccines, combating antibiotic resistance, and understanding the factors that have enabled it to evolve into such a successful pathogen. The use of short-read whole-genome sequencing (SR-WGS) has greatly advanced our understanding of the genetic diversity of the MTBC. However, due to limitations of short-read sequencing, a significant portion (\textasciitilde10\%) of the MTBC genome has been systematically excluded from analysis. In this work we use a new type of technology, long-read sequencing, to confidently study the remaining \textasciitilde10\% of the genome that has been systematically excluded from previous studies of genetic diversity. In this work, we use long-read sequencing to generate 158 high-quality complete genome assemblies of the major lineages of human-adapted MTBC. This allows us to uncover new aspects of MTBC evolution, as well as to benchmark common analysis approaches in microbial genomics. In \textbf{chapter 1}, we utilize 36 complete assemblies to systematically evaluate the accuracy of short-read whole-genome sequencing for variant calling of MTBC isolates. These benchmarking results have broad implications for the use of SR-WGS in the study of MTBC biology, inference of transmission in public health surveillance systems, and WGS applications in other organisms. In \textbf{chapter 2}, we leverage 158 complete genome assemblies to evaluate genome conservation and structural variation. Additionally, we benchmark several common pan-genome analysis pipelines and find that they are prone to overinflate predicted accessory genome size. In \textbf{chapter 3}, we present evidence that gene conversion is a key driver of genetic diversity in a set of hotspots within the MTBC genome. A majority of gene conversion events affect substrates of the ESX secretion systems (PE, PPE, and Esx proteins), a secretion system implicated in virulence. These findings suggest there is an understudied evolutionary force acting on the MTBC genome.

MeSH terms

  • Genome
  • Biology
  • Mycobacterium tuberculosis complex
  • Computational biology
  • Genetics
  • Whole genome sequencing
  • Genomics
  • DNA sequencing
  • Comparative genomics
  • Infectious disease (medical specialty)
  • Evolutionary biology
  • Bacterial genome size
  • Context (archaeology)