TB Research

Combining TD-IDF with symptom features to differentiate between lymphoma and tuberculosis case reports

Moanda Diana Pholo, Yskandar Hamam, Abdelbaset Khalaf, Chunling Du

Abstract

In regions where tuberculosis (TB) is a high burden disease, empirical anti-TB treatment is generally recommended. However, TB can mimic a number of other diseases such as lymphoma, leading to high rates of misdiagnosis. This paper therefore suggests the use of machine learning and natural language processing techniques in the differentiation between tuberculosis and lymphoma.To conduct this study, medical case reports were collected automatically and converted into word vectors, which were augmented by adding symptoms and biographical features extracted from the case reports. Different machine learning algorithms were applied to the collected data, which was comprised of 215 TB cases, 505 lymphoma cases and 207 "other" cases. Each algorithm was evaluated based on accuracy, precision and recall. With an accuracy of up to 97.3%, and both precision and recall scores of up to 96%, logistic regression achieved best across datasets and metrics, although performing better on the augmented dataset.

MeSH terms

  • Tuberculosis
  • Logistic regression
  • Recall
  • Artificial intelligence
  • Lymphoma
  • Computer science
  • Machine learning
  • Precision and recall
  • Disease
  • Natural language processing
  • Medicine