Abstract:
Big data is a breakthrough technology that has been developed over the years. Big data means very
large set of data that grows in ever increasing rate and the volume of data is of size Exabyte (1018).
It refers to extensively varied data that is processed at very high velocity. Big data is being
generated from social media, websites, personal electronics, apps etc. This volume of data
currently exceeds the computational capabilities of conventional systems. Unstructured text data
produced from several sources needs to be processed and organized. Taxonomy effectively
organizes the data in today’s digital world. Data in today’s digital world is growing at a rapid pace.
A taxonomy generated for big data should represent the underlying data and changing theme of
data. When this existing taxonomy is evolved, again it should reflect changes that has occurred in
data. There is a need of incremental taxonomy generation technique that handles the fast arriving
big data of documents and arranges it in a hierarchical structure and also on the next input stream
of data it evolves the existing hierarchical structure by adjusting the new data stream. In order to
cater the fast arriving big data, the technique needs to run on a parallelization framework so that
the running time of incremental taxonomy generation process can be reduced and to improve the
scalability challenges of current incremental taxonomy generation techniques. This work presents
a technique for incremental taxonomy generation for unstructured text data on Apache Spark
framework. The proposed technique not only generates the taxonomy on parallelization framework
but also incrementally updates the existing taxonomy. The technique is tested in comparison to
non-incremental taxonomy generation techniques. It was found that the proposed technique
generates a taxonomy by taking less time as compared with existing taxonomy generation
techniques that can make taxonomy utilization more effective. The proposed technique was also
tested in comparison to incremental taxonomy generation techniques. Through experiments it was
found that the proposed technique updated an existing taxonomy in considerably less time as
compared with the existing incremental taxonomy generation techniques. The proposed technique
updates the existing taxonomy in seconds, whereas previous algorithms were taking time in
minutes and hours for the process. This research work also provides a comparison between two
prominent big data environments i-e Apache Hadoop and Apache Spark so it could be investigated
iv
that which big data environment is better suited for a clustering problem like incremental
taxonomy generation. Through experiments it was found that Apache Spark is faster and well
better suited for a clustering problem like taxonomy generation as compared with Apache Hadoop.
The proposed technique was also ran on different configurations of Apache Spark, to find out the
optimal number of cores for running any hierarchical clustering jobs on Apache Spark.