Abstract:
In the current digital era, data generation sources are producing abundant data volumes related
to each field of life. Volumes of data are expanding every second because sources of data
creation are also expanding. Machines (i.e. sensors, devices) and humans (i.e. filled forms,
saved data) are the main sources of data created in this digital era. Machines record each and
every relevant event in their recording medium by combining the recorded inputs.
Occasionally, machines receive only a few inputs and save them by keeping the missing input
values as null. Sometimes an identical event is logged twice as duplicate data by the single or
multiple machines in the same recording medium. Duplicate data and missing values issues are
also found in human-generated data mostly due to human error. Missing values and duplicates
affect the quality of data and biases the data mining results. An efficient Data Cleansing System
(DCS) is developed during the research to improve the quality of data by filling the missing
values and removing the duplicate records from the dataset. Similarity-based (SimFiller) and
duplicate detection based (DuDeFiller) missing values filling algorithms are developed and
integrated into the system to fill the missing values of a record by taking the missing value
replacement from its most similar or duplicate record. Duplicate detection based data fusion
algorithm (DuDeFuse) is also developed and integrated into the system to merge the duplicate
records based on the maximum similarity of the attribute’s value or the maximum occurrence
of the attribute’s value. Two data sets from the UCI machine learning repository are cleaned
through the system to improve their data quality. The selected datasets are also cleaned through
the five others exiting missing values filling algorithms for the purpose of comparison. Five
classifiers including “Naive Bayes”, “Decision Tree”, “Random Forest”, “Deep Learning”, and
“Logistic Regression” are selected to check the classification accuracy and The f-measure of
the cleaned datasets. Average classification accuracy of both datasets is increased up to 3.00%
after cleansing the datasets with the developed system. The f-measure of the datasets is also
increased up to 3.26% after cleansing them with the developed system as compared to the data
cleansing performed with the other algorithms. The developed system can be extended to
resolve the other inconsistencies in the datasets and can also be evaluated for the other datasets.