Abstract:
Semantic Web is an emerging technology that has enabled machines to ma-
nipulate web data and produce useful results. RDF is the W3C standard
framework for Semantic Web data annotation. RDF data repositories are
growing both in number and size with each passing day. This continuous
growth of Semantic Web data has attracted the research community to work
for the e cient access and manipulation of RDF data. Jena, Sesame and
Openlink Virtuoso have laid the groundwork for the development of RDF
data management systems. However, these traditional RDF data manage-
ment approaches are centralized and are not able to manage huge volumes
of RDF data. Keeping in view the growing volume of RDF data; emerging
RDF tools are based on distributed technologies. This paper is a contribution
in the aforementioned domain. It aims at providing a distributed SPARQL
query framework for RDF data processing using Hadoop. We combine two
components of Hadoop; MapReduce and HBase. MapReduce provides e -
cient processing of huge volumes of data on commodity hardware, whereas
HBase stores semi-structured data in a scalable way. MapRQL generates cus-
tomized MapReduce jobs for SPARQL graph pattern and runs MapReduce
jobs over huge volumes of RDF data, stored in HBase. MapRQL supports
SPARQL queries with basic graph pattern, basic graph pattern with lter
constraint, union or alternate graph pattern, and optional graph pattern.
MapRQL is evaluated on Barton dataset by observing execution time with
gradually increasing dataset size up to 50 million RDF triples. SPARQL
queries are run directly over Hive, using the same dataset, and query execu-
tion time taken by MapRQL is compared with that of Hive. Results show
signi cant performance gain of MapRQL to execute SPARQL queries. In-
dexing can be implemented in MapRQL for e cient retrieval of RDF data
in future.