Abstract:
In integration of heterogeneous databases, data integrate from different sources. It is very
challenging task because data model and representation of data varies in different relational
databases. It will be more complicated when we are talking about relational for example SQL
(Structured Query Language) and non-relational for example NoSQL (not only SQL) databases.
In past, researchers focused on integration of different relational databases. Now a days
Integration of SQL and NoSQL become an important issue because of popularity of NoSQL.
Until now, various techniques of supervised machine learning algorithms have been introduced
to solve the problem of heterogeneous database integration. Every method perform integration in
its own unique way. we are introducing unsupervised machine learning algorithms to perform
integration. The main idea of this approach is to integrate relational and non-relational database
for increasing the efficiency of data by using unsupervised machine learning algorithms. So, we
don’t need to train and supervise our dataset. The proposed approach is to first get data from
Mongo DB and apply clustering on that data by set centroid values. The algorithm is than
represent clusters with different color. Each cluster represent specific table of SQL database. We
would also explore best machine learning algorithm by comparing different algorithms based on
accuracy. We only used K-means, spectral, agglomerative and mean shift algorithms. For
validation of clustering of each algorithm we used confusion matrix. The proposed approach has
been validated through multiple case studies.
Therefore, there is a gap between supervised machine learning algorithm techniques
and unsupervised machine learning algorithm techniques. So, there is need to provide an
unsupervised level solution to automatically integrate NoSQL to SQL databases to overcome
research gap. We have proposed this solution for integration of relational and non-relational
databases through unsupervised machine learning algorithm. This automatically predict similar
data of non-relational database in the form of clusters so we can represent entities of relational
database.