Standardized Blood Transcriptomic Datasets for Tuberculosis State Classification (ATB, LTB, CON, OD)

Brendan Alex

Open MIND · 2026-03

Abstract

This repository contains 10 curated blood transcriptomic datasets used to evaluate machine learning models and previously published tuberculosis (TB) disease-state discriminatory gene panels. Pediatric and HIV-positive samples were removed to reduce confounding. Both normalised and non-normalised gene expression matrices are included. Normalisation was performed using z-score normalisation. The datasets are divided into 6 training sets and 4 test sets for benchmarking model performance. Samples are annotated using four disease states: ATB (Active Tuberculosis), LTB (Latent Tuberculosis), OD (Other Diseases), and CON (Healthy Controls) Not all datasets contain all four disease states. Sample class labels are embedded in the column headers of each expression matrix. Each file contains gene symbols in the first column, with expression values for individual samples in the remaining columns. All files are provided in CSV format. The datasets are derived from publicly available Gene Expression Omnibus (GEO) studies and correspond to their respective GSE identifiers. The Individual GEO Datasets referred to for this compilation include the following: 19491, 37250, 73408, 19439*, 19444*, 28623, 101705, 107994, 42834 and 83456. Credits for the original Datasets go to the respective authors. *Note that these datasets are subsets of the 19491 superseries

MeSH terms

Benchmarking
Computer science
Transcriptome
Artificial intelligence
Data mining
Computational biology
Class (philosophy)
Gene expression
Gene
Expression (computer science)
Sample (material)

Standardized Blood Transcriptomic Datasets for Tuberculosis State Classification (ATB, LTB, CON, OD)

Abstract

MeSH terms

Related papers