TB Research

Standardized Blood Transcriptomic Datasets for Tuberculosis State Classification (ATB, LTB, CON, OD)

Brendan Alex

Open MIND · 2026-03

Abstract

This repository contains 10 curated blood transcriptomic datasets used to evaluate machine learning models and previously published tuberculosis (TB) disease-state discriminatory gene panels. Pediatric and HIV-positive samples were removed to reduce confounding. Both normalised and non-normalised gene expression matrices are included. Normalisation was performed using z-score normalisation. The datasets are divided into 6 training sets and 4 test sets for benchmarking model performance. Samples are annotated using four disease states: ATB (Active Tuberculosis), LTB (Latent Tuberculosis), OD (Other Diseases), and CON (Healthy Controls) Not all datasets contain all four disease states. Sample class labels are embedded in the column headers of each expression matrix. Each file contains gene symbols in the first column, with expression values for individual samples in the remaining columns. All files are provided in CSV format. The datasets are derived from publicly available Gene Expression Omnibus (GEO) studies and correspond to their respective GSE identifiers. The Individual GEO Datasets referred to for this compilation include the following: 19491, 37250, 73408, 19439*, 19444*, 28623, 101705, 107994, 42834 and 83456. Credits for the original Datasets go to the respective authors. *Note that these datasets are subsets of the 19491 superseries

MeSH terms

  • Benchmarking
  • Computer science
  • Transcriptome
  • Artificial intelligence
  • Data mining
  • Computational biology
  • Class (philosophy)
  • Gene expression
  • Gene
  • Expression (computer science)
  • Sample (material)