NUST Institutional Repository

Modelling of Variables of Leukemia by Analyzing Complete Blood Count Reports of Normal and Disease Cases: A Case Study of Pakistan

Show simple item record

dc.contributor.author Iqbal, Azka
dc.date.accessioned 2021-09-17T04:28:45Z
dc.date.available 2021-09-17T04:28:45Z
dc.date.issued 2021-04-25
dc.identifier.other RCMS003267
dc.identifier.uri http://10.250.8.41:8080/xmlui/handle/123456789/26105
dc.description.abstract Leukemia is one of the fatal diseases that originates in the bone marrow and causes abnormal proliferation of White Blood Cells (WBC), Red Blood Cells (RBC) and Platelets. A basic and usual investigation/screening test which may signify leukemia disease is CBC report. A CBC report measures the parameters and features of almost all different types of cells present in the blood. Current investigation procedure of leukemia using a CBC report is usually subjective. Thus, varies from practitioner to practitioner; hence, having high risk of mis/no diagnosis. Therefore, there is a need to develop objective data driven models for the prediction of leukemia. This study is designed to develop predictive models using logistic regression based on significant variables of a CBC report for screening of leukemia. Primary data of 302 CBC reports is collected from eight hospitals of Rawalpindi and Islamabad (twin cities of Pakistan). In these reports, 235 are disease/leukemic cases and 67 are normal/non-leukemic cases. The analysis consists of three sections. Section I deals with pre-processing of the variables. A CBC report usually consists of 21 variables namely Age, Gender, White Blood Cell count (WBC), Red Blood Cell count (RBC), Hemoglobin (Hb), Haematocrit (HCT/PCV), Mean Corpuscular Volume (MCV), Mean Corpuscular Haemoglobin (MCH), Mean Corpuscular Haemoglobin Concentration (MCHC), Platelet Count (PLT), Neutrophil Count (Neut), Lymphocytes count (LYM), Basophil Count (BASO), Eosinophil Count (Eo), Monocytes Count (Mo) , Neutrophil Percentage, Lymphocytes Percentage, Basophil Percentage, Eosinophil Percentage, Monocytes Percentage and Reticulocytes percentage (RT). In pre-processing step, variables with high percentage of missing values have been dropped like the variable “Reticulocytes percentage” having 67.33% missing values. Overall, for any variable, all values with entry “zero (0)” are considered as missing values. In case any entry is missing in pair of values of the variables, complete entry is deleted. Therefore, a total of 15 cases have been deleted and 287 cases or entries have been used for further analysis. A CBC report includes duplicate information of few variables in terms of their counts as well as percentage, for instance, “Neutrophil”. x Abstract To avoid this duplication, variables having information of percentages have been dropped and final set of 15 variables have been selected for further analysis. These short-listed variables are Age, Gender, WBC, RBC, Haemoglobin, Haematocrit, MCV, MCH, MCHC, Platelet Count, Neutrophil Count, Lymphocyte count, Basophil Count, Eosinophil Count, Monocytes Count. Section II provides results of independent sample t-test to compare means of Normal vs Disease cases for the 14 quantitative variables. Results show that 11 variables have significant difference between means of normal and disease cases while 3 variables Age, MCV and MCH are showing insignificant difference. A bivariate correlation analysis has been performed to check the existence of multicollinearity in variables. Results show that variables have strong significant correlations between them. Therefore, inclusion of all the variables in the development of binary logistic regression is not appropriate and can introduce problem of multicollinearity. Section III deals with the development of binary logistic regression model. Seven different methods of model development namely Enter Method, Forward Stepwise Selection and Backward Stepwise Elimination (using Conditional, Likelihood Ratio, Wald’s criteria) have been used in the study. Features/variables selection has been done using Wald's criteria (p-value) and the odds ratios. The results of different combinations of model specification show that 5 variables Gender, Hemoglobin, MCHC, Neutrophil Count and Monocyte Count are statistically and biologically significant for the screening of leukemic patients using CBC report. For the binary logistic model based on these 5 variables; the accuracy, sensitivity, specificity, and precision are about 92%, 94%, 86% and 95% respectively. The results of the study are useful for the physicians in decision making for the screening of leukemia using estimates of different characteristics/variables of a CBC report. A combination of objective and subjective judgment will improve accuracy and precision in early diagnosis or screening of leukemia using a common/cheaper test. en_US
dc.description.sponsorship Dr. Zamir Hussain en_US
dc.language.iso en_US en_US
dc.publisher RCMS NUST en_US
dc.subject Leukemia, Modelling, Blood, Disease en_US
dc.title Modelling of Variables of Leukemia by Analyzing Complete Blood Count Reports of Normal and Disease Cases: A Case Study of Pakistan en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • MS [159]

Show simple item record

Search DSpace


Advanced Search

Browse

My Account