Decision letter: Previously undetected super-spreading of Mycobacterium tuberculosis revealed by deep sequencing
Conor J. Meehan
Abstract
Article Figures and data Abstract eLife digest Introduction Materials and methods Results Discussion Data availability References Decision letter Author response Article and author information Metrics Abstract Tuberculosis disproportionately affects the Canadian Inuit. To address this, it is imperative we understand transmission dynamics in this population. We investigate whether ‘deep’ sequencing can provide additional resolution compared to standard sequencing, using a well-characterized outbreak from the Arctic (2011–2012, 50 cases). Samples were sequenced to ~500–1000x and reads were aligned to a novel local reference genome generated with PacBio SMRT sequencing. Consensus and heterogeneous variants were identified and compared across genomes. In contrast with previous genomic analyses using ~50x depth, deep sequencing allowed us to identify a novel super-spreader who likely transmitted to up to 17 other cases during the outbreak (35% of the remaining cases that year). It is increasingly evident that within-host diversity should be incorporated into transmission analyses; deep sequencing may facilitate more accurate detection of super-spreaders and transmission clusters. This has implications not only for TB, but all genomic studies of transmission - regardless of pathogen. eLife digest In Canada, tuberculosis disproportionately affects the Inuit, a group of indigenous people inhabiting the Arctic regions. Canada is aiming to eliminate tuberculosis among the Inuit by 2030. One way to help stop transmission and prevent future outbreaks is to trace how and where the disease spreads using DNA sequencing. This information can then be used by public health organizations to identify possible interventions. Typically, the DNA of the bacterium that causes tuberculosis – Mycobacterium tuberculosis, or Mtb for short – is sequenced 50–100 times and a consensus DNA sequence is then generated for each patient from this data. These consensus DNA sequences are then compared to help piece together who infected whom. Recently, scientists have realized that the bacteria a person is infected with may have different DNA sequences due to people being infected with more than one bacterium or the bacterium developing variations in its genome after the infection. However, current DNA sequencing practices may miss these differences, making it harder to trace how the disease spreads. Now, Lee et al. show that sequencing the DNA of Mtb from an infected person 500–1000 times (i.e. ∼10-20 times more than usual) makes it easier to detect genetic differences and determine how tuberculosis spreads. This approach, also known as ‘deep sequencing’, was used to analyze DNA samples of Mtb collected from about 50 people during an outbreak of tuberculosis in 2011-2012, which had previously undergone standard DNA sequencing. This deep sequencing approach identified a ‘super-spreading event’ where one person had likely transmitted tuberculosis to up to 17 others during the outbreak. Lee et al. found that most of these people had visited the same ‘gathering houses’ which are social venues in the community. Implementing targeted public health interventions at these sites may help stop future outbreaks. To fully understand how useful this method will be for tracking the spread of tuberculosis, deep and routine sequencing will need to be compared against each other in different settings and outbreaks. Furthermore, the approach used in this study may be useful for tracking the transmission of other infectious diseases. Introduction Tuberculosis (TB) in Canada is highest among the Inuit, an Indigenous population with a rate over 300 times that of the non-Indigenous Canadian-born population in 2016 (Inuit Tapiriit Kanatami, 2018). Canada recently set a goal of TB elimination in the Inuit by 2030, (Inuit Tapiriit Kanatami, 2018) which will not be achieved without halting ongoing transmission. Previous studies have used genomic data either alone or in conjunction with classical epidemiology to investigate TB transmission dynamics in the Canadian North, (Tyler et al., 2017; Lee et al., 2015a; Lee et al., 2015b) with the aim of identifying clusters to help guide public health interventions. Thus far, such studies have relied on identifying consensus single nucleotide polymorphisms (cSNPs), consistent with prevailing methodology in this field. Recent studies suggest that incorporation of within-host diversity into genomic analyses may provide greater resolution of transmission than cSNP-based approaches alone (Worby et al., 2017; Martin et al., 2018; Meehan et al., 2019; Séraphin et al., 2019). This may be particularly important for investigation of outbreaks occurring over short time scales and/or in settings such as the Canadian North, where the genetic diversity of circulating strains is especially low. In both of these circumstances, it is common to find many samples separated by zero cSNPs, hindering accurate source ascertainment. To investigate this hypothesis, we used deep sequencing (i.e., to ~10-20 fold more than standard, or 500-1000x) to re-evaluate transmission in a densely-sampled outbreak in Nunavik, Québec. This outbreak, which has been previously described, (Lee et al., 2015b; Lee et al., 2016) comprised 50 microbiologically-confirmed cases of TB who were diagnosed in a single Inuit community between 2011–2012 - a rate of 5,359/100,000 for that year. Genomic epidemiology analyses using sequencing depths of ~50x that are standard in such work, identified multiple clusters of transmission in this outbreak, (Lee et al., 2015b) however, there was insufficient genetic variation detected to infer precise person-to-person transmission events within these subgroups, given the short time frame and low mutation rate of M. tuberculosis (~0.2–0.3 SNPs/genome/year for Lineage 4; Menardo et al., 2019). In this study, we illustrate how within-host diversity can be incorporated into transmission analyses. In doing so, we find new features of the transmission networks in this community, in particular, identifying a previously unrecognized super-spreading event. We highlight a potential role for deep sequencing in public health investigations, with implications for TB control in Canada’s North as well as other high-transmission environments. Materials and methods Study subjects Request a detailed protocol All 50 samples from the 2011–2012 outbreak (Lee et al., 2015b) were eligible for inclusion, as well as samples from all cases (n = 15) diagnosed in same village in the preceding five years (2007 onwards), 13/15 of which were caused by the same strain of M. tuberculosis (the ‘Major [Mj]-III’ sublineage; Lee et al., 2015a). There were two episodes of recurrent TB (i.e., where an individual had microbiologically-confirmed TB once, was cured, but developed TB again during the study period); otherwise, all samples are from unique individuals. All cases had pulmonary TB that was Lineage 4 (Euro-American; Lee et al., 2015b). Cross-contamination was ruled out as described in Lee et al. (2015b). DNA extraction and sequencing Request a detailed protocol Samples were cultured once on Middlebrook 7H10 agar and plate sweeps were collected for DNA extraction using the van Soolingen method (van Soolingen et al., 1991). Genomic DNA was quantified using the Quant-iT PicoGreen dsDNA Assay (ThermoFisher Scientific, Massachusetts, USA). Library preparation and sequencing were done at the McGill University/Genome Québec Innovation Centre. The Illumina HiSeq 4000 was used to produce paired-end 100 bp reads. To obtain the depth of coverage needed for this study (~500–1000x for deep sequencing, compared to ~50–100x as routinely done by public health), pooled libraries were run on four independent lanes. Bioinformatics Request a detailed protocol FastQC (v.0.11.5, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was used to assess sequencing data quality and reads were trimmed to remove low-quality bases using Trimmomatic (v.0.36 Bolger et al., 2014). Kraken (v.1.1 Wood and Salzberg, 2014) was then used to identify potential contamination with the miniKraken database (minikraken_20171019). Reads classified as ‘Mycobacterium tuberculosis complex’ (MTBC) were extracted using Seqtk (v.1.2, Li H, available at: https://github.com/lh3/seqtk). Reads were then aligned using the Burrows Wheeler Aligner MEM algorithm (v.0.7.15 Li, 2013) to the H37Rv reference (NC_000962.3 in the National Center for Biotechnology Information [NCBI] RefSeq database) and sorted using Samtools (v.1.5 Li et al., 2009). Analyses were later repeated using a local reference genome (described below). Reads were with ambiguous mappings were excluded, as were reads with excessive soft-clipping (i.e., more than 20% of read length) based on our previous work (Martin et al., 2018). Duplicate reads were marked using Picard MarkDuplicates (v.2.9.0, https://broadinstitute.github.io/picard/) and reads were locally re-aligned around indels using Genome Analysis ToolKit (GATK, v.3.8 McKenna et al., 2010). All sites were called using GATK’s Unified Genotyper algorithm, with the -d 1500 to avoid downsampling to 250 (done by default with this tool during variant calling). Variants (cSNPs and heterogenous SNPs [hSNPs]) versus H37Rv were annotated using snpEff (v.4.3t Cingolani et al., 2012). Variants were filtered for quality using custom Python scripts (v.3.6) with the following thresholds: Phred < 50, Root Mean Squared Mapping Quality (RMS-MQ) ≤ 30, depth (DP) < 20, Fisher Strand Bias (FS) ≥ 60 and read position strand bias (ReadPos) < −8 (Martin et al., 2018). cSNPs were classified as positions where ≥ 95% of reads were the alternative allele (ALT), hSNPs were classified as positions where > 5% and < 95% of reads were ALT, and positions with the ALT present in ≤5% of reads were classified as ‘reference’. We also compared inferences of transmission from this analysis to i) when these thresholds were increased to the minimum values among cSNPs in the initial H37Rv analysis, and ii) when cSNPs were classified using a threshold of ≥ 99%, and hSNPs were classified when 1% < ALT < 99%, in order to assess the robustness of inferences to different filtering protocols. Low-quality variants, variants in proline-proline-glutamic acid (PE) and proline-glutamic-acid/polymorphic-guanine-cytosine-rich sequence (PE_PGRS) genes, transposons, phage and integrase, and positions with missing data, were excluded. All samples were drug-susceptible, except for MT-6429, which was rendered resistant to isoniazid by a frameshift deletion at position 1284 in the catalase-peroxidase gene katG. As such, positions associated with drug resistance were not masked in this analysis. Alignments with informative hSNPs were reviewed using Tablet (v.1.17.08.17, Milne et al., 2013). Concatenated core cSNP alignments were made using snp-sites -c (v.2.4.0 Page et al., 2016), with positions with hSNPs excluded. Pairwise cSNP distances between samples were computed using snp-dists (v.0.6, available at https://github.com/tseemann/snp-dists). The frequency of hSNPs at each position in the genome was tabulated and hSNPs were reviewed to identify variants shared between samples. Phylogenetics and clustering Request a detailed protocol Core cSNP alignments were used to generate maximum likelihood trees using IQ-Tree (v.1.6.8 Nguyen et al., 2015). Model selection was based on the lowest Bayesian Information Criterion. Hierarchical Bayesian Analysis of Population Structure (Cheng et al., 2013) was run in R (v.3.5.2) to identify clusters. Phylogenetic trees were visualized using Interactive Tree of Life (Letunic and Bork, 2016). Single molecule Real-Time (SMRT) sequencing and assembly Request a detailed protocol To examine the influence of potential alignment errors in identification of hSNPs, we used SMRT sequencing with the PacBio RSII platform to create a local reference genome for the outbreak. Sample MT-0080 was chosen for sequencing because this was previously identified as the probable source for as many as 19 of the 50 cases diagnosed in 2011–2012 (Lee et al., 2015b). Prior to sequencing, the culture was grown on a Middlebrook 7H10 agar plate. A single colony was then selected and grown further in 3 mL of Middlebrook 7H9 Broth to provide sufficient DNA for SMRT sequencing and Illumina MiSeq (for polishing of the long-read assembly). DNA for SMRT sequencing was extracted using the MagAttract High Molecular Weight DNA Kit from Qiagen (Maryland, USA). High molecular weight fragments were verified using gel electrophoresis. Library preparation and sequencing were then done at the McGill University/Genome Québec Innovation Centre. Prior to sequencing, fragment size was evaluated using a BioAnalyzer and the BluePippin system (Sage Science, Massachusetts, USA) was used for size selection. DNA for Illumina MiSeq was extracted using the van Soolingen method, as previous (van Soolingen et al., 1991). Long-reads were assembled and corrected using Canu (v.1.7.1 Koren et al., 2017). Pilon (v.1.23 Walker et al., 2014) was then used to polish the assembly using the Illumina MiSeq reads from the same colony. This was re-run until no further corrections were possible. Quast (v.5.0.2, Gurevich et al., 2013) was used to evaluate assembly quality. RASTtk (v.2.0 Brettin et al., 2015) was used for annotation, to identify regions for masking as previous. Epidemiological data Request a detailed protocol Epidemiological and clinical data were collected on all cases and contacts using standardized questionnaires as part of the routine public health response, described previously in Lee et al. (2015b); Lee et al. (2016). Statistical analyses Request a detailed protocol A two-sample test of proportions was used to compare overall proportions across references, and the Wilcoxon Signed Rank test was used to compare paired SNP distances. Analyses were done in Stata (v.15, StataCorp, College Station, TX, USA). Results 62/65 (95·4%) available TB samples from cases diagnosed between 2007–2012 were successfully sequenced and passed quality control. This included 48/49 (98·0%) of the samples with an identical Mycobacterial Interspersed Repetitive Units Variable Number Tandem Repeats (MIRU-VNTR) pattern during the outbreak year. The remaining three samples could not be re-grown. Reads that were non-MTBC were removed (Source data 1) and there was no obvious association between percent contamination and hSNP frequency. Epidemiological and clinical data on all outbreak cases are described in Lee et al. (2015b). Average genome coverage and depth across the H37Rv reference was 98·64% [SD 0·07%] and 714·53 [SD 92·68], respectively. Our primary filtering protocol yielded 51,430 cSNPs and 4,897 hSNPs across all individual samples (Source data 2). Excluding positions that were invariant compared to the reference or where any sample was missing and/or was low-quality resulted in a core alignment of 860 cSNP positions and 136 hSNP positions (note, these are not mutually exclusive, as positions with cSNPs in some samples may have hSNPs in others). 42 positions had hSNPs that were shared across all 62 samples (Table 1, Source data 3). Depth of coverage at these positions was, on average, 39% higher than the average depth across the same sample (SD 36·7%, Source data 4). Along with manual review of alignments (Figure 1), this suggested that many of these were false positives, potentially due to alignment error (e.g., from underlying structural variation in our samples compared to the H37Rv reference). Figure 1 Download asset Open asset Pileup of reads showing hSNPs suspected to be due to alignment error as listed in Source data 3, with MT-4942 used as an example and zoomed on position 2,255,171 to 2,280,170 in H37Rv (National Center for Biotechnology Information RefSeq Database Accession NC_000962.3). Binary Alignment Map (BAM) file were loaded into Tablet (v.1.17.08.17, Milne et al., 2013) to visualize the pileup compared to H37Rv. Table 1 Comparison of alignments to H37Rv and MT-0080_PB. Based on these filters: Phred < 50, Root Mean Square Mapping Quality (RMS-MQ) ≤ 30, depth (DP) < 20, Fisher Strand Bias (FS) ≥ 60 and read position strand bias (ReadPos) < −8 and an allelic fraction of ≥ 95% for cSNPs, with hSNPs classified when 5% < ALT < 95%. Quality metrics for the individual cSNPs/hSNPs identified in each sample are given in Source data 2. H37Rv (4,411,532 bp)MT-0080_PB (4,426,525 bp)P valueNumber of positions according to reference genomeInvariant reference across all samples, n (%)4,018,786 (91·10%)4,084,195 (92·27%)<0·00005 aPosition was missing/low quality in at least one sample, n (%)391,761 (8·88%)342,179 (7·73%)<0·00005 aPosition was an c/hSNP in at least one sample, n (%)985 (0·22%)152 (0·00%)<0·00005 aShared cSNPs across all samples, n (%)764 (0·02%)1 (0·00%)<0·00005 aShared hSNPs across all samples, n (%)42 (0·00%)0 (0%)<0·00005 aCore pairwise distancesCore cSNPs vs. reference, median (range)791 (790–792)3 (1–65)<0·00005 bCore cSNPs between samples, median (range)3 (0–64)3 (0–66)<0·00005 b a Two sample test for difference in proportions. b Wilcoxin Signed Rank test. To address this, we generated a local reference genome for the outbreak, MT-0080_PB. Quality metrics for the MT-0080_PB assembly are given in Source data 5. Compared to H37Rv, mean genome coverage and depth were higher with MT-0080_PB (at 99·33% [SD 0·09%] and 717·07 [SD 93·01], respectively), fewer positions were missing/low-quality (p < 0·00005, Table 1), and overall, fewer variable positions were detected (p < 0·00005). While core cSNP distances were similar between samples regardless of the reference (Table 1), the number of hSNPs was greatly reduced using MT-0080_PB (Source data 2); while 4,897 hSNPs were identified across all individual samples using H37Rv, only 125 hSNPs were identified using MT-0080_PB. There were also no hSNPs shared across all 62 samples using MT-0080_PB. Together, these findings support our hypothesis that alignment error is responsible for many of the detected variants, and indicate a local reference is important for accurate identification of All further are based on the MT-0080_PB A maximum likelihood was generated from core cSNP positions sites invariant across all samples and the compared to MT-0080_PB (Figure 2). All 62 samples and were for with our previous (Lee et al., 2015b) with this, (Lee et al., 2015b) identified two and Lee et al., with three Figure with 1 all Download asset Open asset of M. tuberculosis in village likelihood of 62/65 cases diagnosed between 2007–2012 in village based on consensus single nucleotide polymorphisms to a local reference, cSNPs were identified based on a minimum threshold of of reads the alternative A core cSNP alignment was then with IQ-Tree (v.1.6.8 Nguyen et al., 2015) was used to generate the using a with for Model selection was based on the lowest Bayesian Information Criterion. were only values > are were identified using Bayesian Analysis of Population Structure (Cheng et al., 2013). These clusters were consistent with the previously identified in Lee et al. Lee et al. only are and were present in was in village in was in and was in informative for transmission in identified using deep sequencing, are there were two who had a of are used to highlight these samples, with a different for each MT-0080 is included in the alignment as the deep sequencing data from a of all identified a cSNP compared to the MT-0080_PB reference, which was generated from a single colony hSNPs identify a novel super-spreading and more transmission clusters The core cSNPs and hSNPs between samples are in Source data with the identified in the analysis hSNPs with the cSNP-based analysis a novel super-spreader in by genomic epidemiology analyses on sequencing (Lee et al., 2015b). In our previous analysis using routine sequencing depth had suggested that was comprised of two transmission networks we to as for with our the of five cases diagnosed between and while the of cases diagnosed between and for these are given in Figure These two were from one based on the or of a shared cSNP (at position according to in the MT-0080_PB Source all samples in the of five cases shared an alternative allele at this while all samples in the of cases shared the reference the short time low mutation rate of TB, and overall low diversity of strains circulating in the we SNPs to in transmission between these In the analysis, was identified as the probable source for the of five This individual was diagnosed in with and had the same local community ‘gathering houses’ venues identified by public health during the as all four others in this the of was identified as the probable as this was the in this in Figure In contrast to the analysis based on routine sequencing data, our investigation of within-host diversity using deep sequencing data that both the alternative allele reads shared by all of the of five as well as the reference allele reads shared by all of the of (Figure that was the diagnosed in (Figure and all cases in the had or in a at the same as this that is in the most probable source for both hSNP analysis support for suspected transmission Sample and were from and were the only strains from the in this Previous analysis strains from other were (Lee et al., 2015b) while these two samples were separated from one by zero core This transmission between these a hypothesis by hSNP analysis, as the samples hSNPs that are not found in any other sample in the These hSNPs were present when filtering thresholds were used (Source data but were not included when using H37Rv as the reference - potentially due to differences in and for TB had TB in samples were available for two of these samples in and in samples in and in Figure 2). both had new detected at compared to previous cSNP-based analyses suggested episodes of TB were due to with a new than with the strain of within-host diversity this using deep sequencing, we verified that there was a different strain present at both and episodes of There was no for at either or with these more out in this low diversity (Source of cSNP and hSNP thresholds To we were not missing frequency variants using the we our analysis such that hSNPs were classified when 1% < ALT < Quality for individual cSNPs and hSNPs are given in Source data and the core alignment is in Source data While our primary analysis using a threshold of for cSNPs identified a single cSNP to shared across all samples compared to of the MT-0080 deep sequencing data using DNA from a of the that this sample had both at this with only the allele for SMRT sequencing. Based on this, we sequencing samples both using a an alternative sequencing and a single colony when a reference genome for TB, as using the alone may error and this no other informative hSNPs were detected using these Discussion As the TB among the Canadian Inuit, targeted public health interventions are to ongoing transmission. In order to so, it is important that transmission events and associated are Our previous work suggested that hSNP analysis could resolution of TB transmission (Martin et al., 2018). To investigate how this approach could be for TB we used deep sequencing to a TB outbreak in the Canadian work by our group (Martin et al., have that M. tuberculosis within-host diversity can be transmitted between et al., 2019; et al., 2019). deep sequencing data allowed us to identify this diversity in a outbreak compared to previous analyses with standard sequencing depth, (Lee et al., 2015a; Lee et al., 2015b) and detection of a novel super-spreading where one source may have transmitted to of the other cases diagnosed between to This was in to a previously identified super-spreader to 19 cases - up to of the outbreak the may be to these has been described in a number of TB et al., Our findings suggest this can an important role in TB and that accurate detection of super-spreading events is important for public health interventions. In the of as all of the cases had the same local community as the this these venues an important role in transmission in this studies have used to investigate TB et al., 2017; et al., et al., 2015) however, the methods used to assess for at either time have been and may not be sufficient to in settings with low strain In this analysis, we provide that deep sequencing can potentially help out The between and is important at individual and population of in a community indicate a with or potentially to clinical while indicate the need for public health interventions such as in who have had for TB disease in the are also not routinely on based on data is by The to which in is but is the primary this clinical may need to be A genomic epidemiology study is to evaluate To deep sequencing to investigate within-host it is we false We have that using a local strain as a reference can not only but detection of Genomic differences between outbreak strains and H37Rv have been previously by et al. and with and that clinical TB strains may be needed to fully detect in analyses. Our analysis
MeSH terms
- Tuberculosis
- Transmission (telecommunications)
- Outbreak
- DNA sequencing
- Deep sequencing
- Genome
- Mycobacterium tuberculosis
- Whole genome sequencing
- Metagenomics
- Genomics
- Biology
- Indigenous
- Population
- Computational biology
- Geography
- Evolutionary biology
- Genetics