Abstract:
Complete Blood Count (CBC) report features are routinely used to screen a wide array
of hematological disorders. The complexity of disease overlap increases the probability
of neglecting the underlying patterns between the features. Additionally, the expertise
of healthcare professionals and heterogeneity associated with the subjective assessment
of a CBC report often lead to random clinical testing. Such disease prediction analyses
can be enhanced by the incorporation of Machine Learning (ML) algorithms for
efficient handling of CBC features. This research presents ML-based models for the
screening of two common blood disorders – anemia and leukemia, using CBC report
features. A ‘fingerprint’ of 14 out of 21 features based on both statistical and clinical
relevance is selected. Hybrid synthetic data are generated based on the statistical
distribution of the features to overcome the constraint of small dataset size. As inferred
from existing knowledge, this study is the first one to employ hybrid synthetic data for
modeling hematological parameters. In this study, six ML models i.e., decision tree,
random forest, support vector machine, logistic regression, gradient boosting machine,
and multilayer perceptron are used. Exceptional performance has been observed by the
random forest algorithm with 98% accuracy and 97%, 98%, 99%, and 2% macroaverages of precision, recall, specificity, and miss-rate respectively for the target
variable. Hence, this algorithm based on CBC features appears to be an efficient support
system for the screening of anemia and leukemia, which has the potential to be deployed
in clinical settings for early intervention of these disorders.