TB Research

An imputed ancestral reference genome for the <i>Mycobacterium tuberculosis</i> complex better captures structural genomic diversity for reference-based alignment workflows

Luke B. Harrison, Vivek Kapur, Marcel A. Behr

bioRxiv (Cold Spring Harbor Laboratory) · 2023-09

Abstract

Abstract Reference-based alignment of short-reads is a widely used technique in genomic analysis of the Mycobacterium tuberculosis complex (MTBC) and the choice of reference sequence impacts the interpretation of analyses. The most widely used reference genomes include the ATCC type strain (H37Rv) and the putative MTBC ancestral sequence of Comas et al . both of which are based on a lineage 4 sequence. As such, these referents do not capture the complete structural variation now known to be present in the MTBC. To better represent the base of the MTBC, we generated an imputed ancestral genomic sequence, termed MTBC 0 from reference-free alignments of closed MTBC genomes. When used as a reference sequence in alignment workflows, MTBC 0 mapped more short sequencing reads and called more SNPs relative to the Comas et al. sequence while exhibiting minimal impact on the overall phylogeny of MTBC. The results also show that MTBC 0 provides greater fidelity in capturing genomic variation and allows for the inclusion of regions absent in H37Rv such as the TbD1 and RvD4496/RD7/RD713 regions in standard MTBC workflows without additional steps. The use of MTBC 0 as an ancestral reference sequence into a standard workflows modestly improved read mapping, SNP calling and intuitively facilitates the study of structural variation and evolution in MTBC. Data Summary The MTBC 0 sequence, is available in the online data supplement in FASTA format at https://github.com/lukebharrison/MTBC0 . Included with the MTBC 0 sequence in the data supplement are: the reference-free alignment of MTBC closed genomes in hierarchical alignment (HAL) format, control files for cactus, annotations for H37Rv and L8, a BED file of regions excluded from SNP calls lifted over onto MTBC 0 , as well as the scripts used to call SNPs and the phylogenetic trees generated in this article. All previously published sequence data is available at the NCBI nucleotide and SRA databases, accession number for sequences used in this manuscript are available in Supplementary Tables 1 and 2. Impact Statement This article describes an imputed ancestral genomic sequence (MTBC 0 ) at the base of the MTBC for use as a reference sequence for Mycobacterium tuberculosis genomic workflows. Widely used reference sequences are limited to the structural diversity present in H37Rv, a lineage 4 isolate. MTBC 0 obviates this limitation by incorporating the structural variation present at the base of the Mycobacterium tuberculosis complex (MTBC) by encompassing a wide sample of human and animal lineages including newly discovered lineages (L8, M. orygis ). Use of MTBC 0 enables the mapping of more reads and calling of more SNPs and allows for the investigation of structural variation not present in the current used reference sequences within this important group of animal and human pathogens.

MeSH terms

  • Mycobacterium tuberculosis complex
  • Reference genome
  • Biology
  • Genome
  • Sequence (biology)
  • Whole genome sequencing
  • Sequence analysis
  • Genetics
  • Computational biology