Benchmarking the empirical accuracy of short-read sequencing across the M. tuberculosis genome
Marin M, Vargas R, Harris M, Jeffrey B, Epperson LE, Durbin D, Strong M, Salfinger M, et al. (14 authors)
Bioinformatics (Oxford, England) · 2022-03
Abstract
Motivation Short-read whole-genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences and sequencing bias reduces the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. To benchmark short-read variant calling, we used 36 diverse clinical Mycobacterium tuberculosis (Mtb) isolates dually sequenced with Illumina short-reads and PacBio long-reads. We systematically studied the short-read variant calling accuracy and the influence of sequence uniqueness, reference bias and GC content. Results Reference-based Illumina variant calling demonstrated a maximum recall of 89.0% and minimum precision of 98.5% across parameters evaluated. The approach that maximized variant recall while still maintaining high precision ( Availability and implementation All relevant code is available at https://github.com/farhat-lab/mtb-illumina-wgs-evaluation. Supplementary information Supplementary data are available at Bioinformatics online.
MeSH terms
- Humans
- Mycobacterium tuberculosis
- Tuberculosis
- Sequence Analysis, DNA
- Software
- Benchmarking
- High-Throughput Nucleotide Sequencing