iValiD-TB: a fully characterized Mycobacterium tuberculosis dataset for antimicrobial resistance bioinformatics workflow validations
Pascal Lapierre, Joseph Shea, Shannon G. Murphy, Carol Smith, Donna Kohlerschmidt, Michelle C. Dickinson, Kimberlee A. Musser, Vincent Escuyer
Frontiers in Tuberculosis · 2024-09
Abstract
Tuberculosis has been a bane of humanity for centuries and, still to this day, is estimated to affect more than two billion people worldwide (CDCGlobal, 2024). The slow growth rate of Mycobacterium tuberculosis (MTB) is a major challenge for timely diagnosis and appropriate treatment of cases (Rageade et al., 2014;Asmar and Drancourt, 2015). Diagnostic delays can lead to increased disease burden, cost increases and risk of treatment failures (Santos et al., 2021). A crucial aspect for clinical assay validations is the availability of well-characterized samples to assess the specificity and sensitivity of the methodology being tested (NYSDOH NGS Molecular Guidance update, 2023). Due to logistical constraints, the nature and availability of specimens, and geographic diversity of the strains, many laboratories struggle with access to adequate clinical MTB specimens for their validation needs (WHO operational handbook on tuberculosis.Module 3: diagnosis -rapid diagnostics for tuberculosis detection, 2021). Consequently, validation studies may not include representative of the myriad of clinical samples and drug resistant profiles that a clinical laboratory may receive. Therefore, diverse and well-characterized datasets for standardized next generation sequencing (NGS) assay validations for MTB NGS tests are needed. Reference datasets of clinical TB samples and synthetic genomes were released in the past with limited phenotypic drug susceptibility testing (DST) information for research, development and proficiency testing purposes (Borrell et al., 2019;Anthony et al., 2023). Here, we have assembled a comprehensive dataset of well-characterized whole genome sequences (WGS) from Mycobacterium tuberculosis strains to aid in the development of clinical assays for this pathogen. This dataset includes complete whole genome sequences paired-end read sets obtained through Illumina MiSeq sequencing, along with detailed profiles of drug susceptibility patterns and mutations known to be associated with antimicrobial resistance (AR) to nine MTB drugs. This dataset has been curated to be inclusive of a broad range of lineage diversity, drug susceptibility profiles, and mutation types. As such, this dataset only contains two separate pair of strains that are phylogenetically close relative (iValiD-TB-S22 and iValiD-TB-S23 with 0 SNP differences and iValiD-TB-S6 and iValiD-TB-S46 with 6 SNPs differences) based on our pipeline estimation. A complete SNP matrix has been included in supplemental table 3. The sequence reads dataset has been made available for bioinformatics pipeline development, and for clinical assay validation of the bioinformatic analysis pipeline, serving as a valuable resource to advance research and enhance the development of clinical MTB NGS assays.A total of 50 members of the Mycobacterium tuberculosis complex (MTBC) were sequenced, which includes 47 strains of Mycobacterium tuberculosis, one Mycobacterium caprae strain, oneMycobacterium bovis strain and one Mycobacterium bovis-BCG strain (Figure 1). These strains were part of our collection of samples obtained from New York State patients since the implementation in 2013 of our clinical diagnostic and reporting TB NGS assay (Shea et al., 2017).Of the MTBC, six strains are from Lineage 1, ten from Lineage 2, seven from Lineage 3, twentyone from Lineage 4, and one representative of each of Lineages 5, 6 and 9. Of the 50 samples, 14 were determined to be pan-susceptible to nine drugs (rifampin, isoniazid, ethambutol, pyrazinamide, streptomycin, ethionamide, fluoroquinolones, kanamycin, amikacin) by phenotypic drug susceptibility testing, 18 were mono-resistant, 17 multi-drug resistant (MDR) and one was extensively drug resistant (XDR) (Figure 1, Supplemental Table 1). A total of 1,073 different mutations presents in the 53 screened loci (Supplemental Table 2) characterized by WGS in this dataset, of which, 107 mutations were identified to be associated with drug resistance, most of which are part of the World Health Organization 2023 Catalogue of mutations in Mycobacterium tuberculosis (WHO, 2023). The New York State Department of Health implemented this assay for clinical diagnostic before the WHO catalog was released and as such, we are using our own susceptibility interpretation criteria. Consequently, the users of this dataset will have to use their own decision criteria based on their individual workflow characteristics and interpretations. The characterized mutations included single nucleotide polymorphisms (SNP), stop codons, promotor mutations, small insertion and deletions (indels) and large genomic deletions. The locations of the mutations, types, effect on drug resistance, as well as DST results, lineage information, spoligotype and expected mapping statistics are all listed in an individual report card for each sample (Figure 2). The release of this fully characterized dataset will facilitate the development and benchmarking of bioinformatics tools for MTB NGS diagnostics and aide in the validation of these clinical assays. The read sequences are accessible from the NCBI SRA Bioproject PRJNA980174. The associated AR reports cards for the 50 samples are available in Dryad at https://doi.org/10.5061/dryad.4j0zpc8m8.Genomic DNA extraction, sequencing library preparation and bioinformatics pipeline methods were described in (Shea et al., 2017). DSTs were determined by either the agar proportion method on solid 7H10 agar or Becton Dickinson 960 system MGIT SIRE-P assay according to the Clinical and Laboratory Standards Institute's recommendations (Woods et al., 2011). The following concentrations were used for DST determinations: Streptomycin 1.0 μg/ml, Isoniazid 0.1, 0.2, 0.4 and 1.0 μg/ml, Rifampin 1.0 μg/ml, Ethambutol 5.0 and 10.0 μg/ml, Pyrazinamide 100 μg/ml, Kanamycin 5.0 μg/ml and ofloxacin (1.0, 2.0, 4.0 μg/ml). Ofloxacin is used in our laboratory as a representative of the fluoroquinolone (FQ) drug class. Genotypic identification, mapping statistics and in-silico spoligotyping were done as described in (Shea et al., 2017).
MeSH terms
- Workflow
- Mycobacterium tuberculosis
- Tuberculosis
- Antimicrobial
- Antibiotic resistance
- Computational biology
- Bioinformatics
- Mycobacterium
- Microbiology
- Biology
- Medicine
- Computer science