Application Of Data Duplicate Detection And Fusion To Improve Data Quality

Fateh-Ur-Rehman

DSpace Home
→
E-Theses
→
CEME
→
Computer Software Engineering
→
MS
→
View Item

Application Of Data Duplicate Detection And Fusion To Improve Data Quality

Fateh-Ur-Rehman

URI: http://10.250.8.41:8080/xmlui/handle/123456789/20230

Date: 2018

Abstract:

In the current digital era, data generation sources are producing abundant data volumes related to each field of life. Volumes of data are expanding every second because sources of data creation are also expanding. Machines (i.e. sensors, devices) and humans (i.e. filled forms, saved data) are the main sources of data created in this digital era. Machines record each and every relevant event in their recording medium by combining the recorded inputs. Occasionally, machines receive only a few inputs and save them by keeping the missing input values as null. Sometimes an identical event is logged twice as duplicate data by the single or multiple machines in the same recording medium. Duplicate data and missing values issues are also found in human-generated data mostly due to human error. Missing values and duplicates affect the quality of data and biases the data mining results. An efficient Data Cleansing System (DCS) is developed during the research to improve the quality of data by filling the missing values and removing the duplicate records from the dataset. Similarity-based (SimFiller) and duplicate detection based (DuDeFiller) missing values filling algorithms are developed and integrated into the system to fill the missing values of a record by taking the missing value replacement from its most similar or duplicate record. Duplicate detection based data fusion algorithm (DuDeFuse) is also developed and integrated into the system to merge the duplicate records based on the maximum similarity of the attribute’s value or the maximum occurrence of the attribute’s value. Two data sets from the UCI machine learning repository are cleaned through the system to improve their data quality. The selected datasets are also cleaned through the five others exiting missing values filling algorithms for the purpose of comparison. Five classifiers including “Naive Bayes”, “Decision Tree”, “Random Forest”, “Deep Learning”, and “Logistic Regression” are selected to check the classification accuracy and The f-measure of the cleaned datasets. Average classification accuracy of both datasets is increased up to 3.00% after cleansing the datasets with the developed system. The f-measure of the datasets is also increased up to 3.26% after cleansing them with the developed system as compared to the data cleansing performed with the other algorithms. The developed system can be extended to resolve the other inconsistencies in the datasets and can also be evaluated for the other datasets.