Incorporating Big Data Aspects In Incremental Taxonomy  Evolution

Aalijah, Kanwal

DSpace Home
→
E-Theses
→
SEECS
→
Computer Science
→
MS
→
View Item

Incorporating Big Data Aspects In Incremental Taxonomy Evolution

Aalijah, Kanwal

URI: http://10.250.8.41:8080/xmlui/handle/123456789/37941

Date: 2020

Abstract:

Big data is a breakthrough technology that has been developed over the years. Big data means very large set of data that grows in ever increasing rate and the volume of data is of size Exabyte (1018). It refers to extensively varied data that is processed at very high velocity. Big data is being generated from social media, websites, personal electronics, apps etc. This volume of data currently exceeds the computational capabilities of conventional systems. Unstructured text data produced from several sources needs to be processed and organized. Taxonomy effectively organizes the data in today’s digital world. Data in today’s digital world is growing at a rapid pace. A taxonomy generated for big data should represent the underlying data and changing theme of data. When this existing taxonomy is evolved, again it should reflect changes that has occurred in data. There is a need of incremental taxonomy generation technique that handles the fast arriving big data of documents and arranges it in a hierarchical structure and also on the next input stream of data it evolves the existing hierarchical structure by adjusting the new data stream. In order to cater the fast arriving big data, the technique needs to run on a parallelization framework so that the running time of incremental taxonomy generation process can be reduced and to improve the scalability challenges of current incremental taxonomy generation techniques. This work presents a technique for incremental taxonomy generation for unstructured text data on Apache Spark framework. The proposed technique not only generates the taxonomy on parallelization framework but also incrementally updates the existing taxonomy. The technique is tested in comparison to non-incremental taxonomy generation techniques. It was found that the proposed technique generates a taxonomy by taking less time as compared with existing taxonomy generation techniques that can make taxonomy utilization more effective. The proposed technique was also tested in comparison to incremental taxonomy generation techniques. Through experiments it was found that the proposed technique updated an existing taxonomy in considerably less time as compared with the existing incremental taxonomy generation techniques. The proposed technique updates the existing taxonomy in seconds, whereas previous algorithms were taking time in minutes and hours for the process. This research work also provides a comparison between two prominent big data environments i-e Apache Hadoop and Apache Spark so it could be investigated iv that which big data environment is better suited for a clustering problem like incremental taxonomy generation. Through experiments it was found that Apache Spark is faster and well better suited for a clustering problem like taxonomy generation as compared with Apache Hadoop. The proposed technique was also ran on different configurations of Apache Spark, to find out the optimal number of cores for running any hierarchical clustering jobs on Apache Spark.