Abstract:
In this era of big data, huge amount of heterogeneous data is produced and shared on the internet
making it a central medium for valuable sources of information. This data on the web can be
published without quality control unlike the traditional media, thus, making it less reliable. Often
data provided by different sources can be conflicting which can be due to noisy, erroneous, or
obsolete data providers. It can also be easily manipulated by bots creating misleading data. This
gives rise to a fundamental challenge for data extraction and fusion. This paper proposes an
automated solution for truth finding from conflicting data by different sources by considering
website credibility. It takes into consideration that different sources have varying degrees of
reliability. It not only considers several factors about the sources but also provides with the true
answer from a credible source. This paper identified seven web credibility categories namely
Accuracy, Authority, Aesthetics, Professionalism, Popularity, Currency and Quality. Each
category has several factors contributing to it. A total of 24 factors were used after applying feature
reduction to approx. 100 identified factors from research. Six different supervised learning
classifiers: Naïve Bayes, Support Vector Machine, Stochastic Gradient Descent, Neural Network,
Decision Trees and Random Forest were employed. Existing solutions focus primarily on finding
relevant web pages but either do not evaluate web pages’ credibility rather focus on trustworthiness
only or evaluate two to three out of seven credibility categories. Experiments on the Book-Author
dataset shows that Random Forest performs the best with an accuracy of 97.45%, Precision 0.975,
Recall 0.975 and F-measure 0.974 when all the categories are used collectively. This is
significantly higher than the baseline method using a single factor that can be categorized to
authority category. The baseline accuracy is 87.77% with a Bayesian based approach.
Furthermore, different experiments using each category separately and in combination were
performed which shows that categories with many factors contribute more to credibility than the
ones with a single factor. These are Professionalism, Popularity and Quality. Also, the importance
of aesthetics category is proved experimentally. Accuracy of 93.47% for aesthetics category alone
shows that it is vital in credibility which is rarely recognized. However, this study focuses
primarily on using all the seven categories for web credibility to resolve conflicting data.