TB Research

Determination of lung cancer exhaled breath biomarkers using machine learning-a new analysis framework

Setlhare TC, Mpolokang AG, Flahaut E, Chimowa G

Scientific reports · 2025-07

Abstract

Exhaled breath samples of lung cancer patients (LC), tuberculosis (TB) patients and asymptomatic controls (C) were analyzed using gas chromatography-mass spectrometry (GC-MS). Ten volatile organic compounds (VOCs) were identified as possible biomarkers after confounders were statistically eliminated to enhance biomarker specificity. The diagnostic potential of these possible biomarkers was evaluated using multiple machine learning models and their performance for classifying patients and controls was compared. Partial least squares-discriminant analysis (PLS-DA) emerged as the best-performing model for separating lung cancer from controls, with a recall (sensitivity) of 82%, precision of 90%, accuracy of 80% and F1-score of 86%. To further validate this model, TB data was introduced as a confounding disease, and the model achieved precision, recall, accuracy and F1-score of 88% each, in distinguishing lung cancer from TB. These findings address the inter-disease variability and underscores the reliability of the reported VOCs as potential biomarkers of lung cancer. This study establishes a new framework integrating machine learning and confounder elimination for biomarker confirmation.

MeSH terms

  • Humans
  • Lung Neoplasms
  • Breath Tests
  • Case-Control Studies
  • Exhalation
  • Adult
  • Aged
  • Middle Aged
  • Female
  • Male
  • Gas Chromatography-Mass Spectrometry
  • Volatile Organic Compounds
  • Machine Learning
  • Biomarkers, Tumor