Abstract:
The rise in information and technology sector has increased storage requirement
in cloud data centers with unprecedented pace. Global storage reached 2.8 trillion
GB as per EMC Digital Universe study 2012 [1] and will reach 5247GB per user by
2020. Data redundancy is one of the root factors in storage scarcity because clients
upload data without knowing the content available on the server. Ponemon Institute
detected 18 percent redundant data in \National Survey on Data Centers Outages"
[15]. To resolve this issue, the concept of data deduplication is used, where each
le has a unique hash identi er that changes with the content of the le. If a client
tries to save duplicate of an existing le, he/she receives a pointer for retrieving
the existing le. In this way, data deduplication helps in storage reduction and
identifying redundant copies of the same les stored at data centers. Therefore,
many popular cloud storage vendors like Amazon, Google Dropbox, IBM Cloud,
Microsoft Azure, Spider Oak, Waula and Mozy adopted data deduplication. In this
study, we have made a comparison of commonly used File-level deduplication with
our proposed Block-level deduplication for cloud data centers. We implemented the
two deduplication approaches on a local dataset and demonstrated that the proposed
Block-level deduplication approach shows 5 percent better results as compared to
the File-level deduplication approach. Furthermore, we expect that the performance
can be further improved by considering a large dataset with more users working in
similar domain.