Corroborating Information from Disagreeing  Views Using Machine Learning Techniques

Riaz, Tayyeba

DSpace Home
→
E-Theses
→
SEECS
→
Information Technology
→
MS
→
View Item

Corroborating Information from Disagreeing Views Using Machine Learning Techniques

Riaz, Tayyeba

URI: http://10.250.8.41:8080/xmlui/handle/123456789/36981

Date: 2022

Abstract:

In this era of big data, huge amount of heterogeneous data is produced and shared on the internet making it a central medium for valuable sources of information. This data on the web can be published without quality control unlike the traditional media, thus, making it less reliable. Often data provided by different sources can be conflicting which can be due to noisy, erroneous, or obsolete data providers. It can also be easily manipulated by bots creating misleading data. This gives rise to a fundamental challenge for data extraction and fusion. This paper proposes an automated solution for truth finding from conflicting data by different sources by considering website credibility. It takes into consideration that different sources have varying degrees of reliability. It not only considers several factors about the sources but also provides with the true answer from a credible source. This paper identified seven web credibility categories namely Accuracy, Authority, Aesthetics, Professionalism, Popularity, Currency and Quality. Each category has several factors contributing to it. A total of 24 factors were used after applying feature reduction to approx. 100 identified factors from research. Six different supervised learning classifiers: Naïve Bayes, Support Vector Machine, Stochastic Gradient Descent, Neural Network, Decision Trees and Random Forest were employed. Existing solutions focus primarily on finding relevant web pages but either do not evaluate web pages’ credibility rather focus on trustworthiness only or evaluate two to three out of seven credibility categories. Experiments on the Book-Author dataset shows that Random Forest performs the best with an accuracy of 97.45%, Precision 0.975, Recall 0.975 and F-measure 0.974 when all the categories are used collectively. This is significantly higher than the baseline method using a single factor that can be categorized to authority category. The baseline accuracy is 87.77% with a Bayesian based approach. Furthermore, different experiments using each category separately and in combination were performed which shows that categories with many factors contribute more to credibility than the ones with a single factor. These are Professionalism, Popularity and Quality. Also, the importance of aesthetics category is proved experimentally. Accuracy of 93.47% for aesthetics category alone shows that it is vital in credibility which is rarely recognized. However, this study focuses primarily on using all the seven categories for web credibility to resolve conflicting data.