Main Article Content

Authors

Krishna Kant Tiwari

Dr. Qaim Mehdi Rizbi

Abstract

In the realm of cloud computing, the management of data redundancy, often referred to as duplicate data, poses a significant challenge due to its implications on storage efficiency, data integrity, and overall system performance. This study provides a comprehensive review of the current strategies employed to detect, manage, and eliminate duplicate data in cloud environments. Duplicate and nearly duplicate web pages cause major issues for web search engines. Duplicate document identification, or the process of finding document pairs that represent the same entity, is a basic component of data cleansing. In order to address the structural heterogeneity issue, data must first undergo a data preparation step that involves parsing, data transformation, and data standardization. The findings aim to aid cloud service providers and users in understanding the importance of effective data deduplication strategies to optimize storage resources and enhance cloud computing performance.

Downloads

Download data is not yet available.

Article Details

Section

Articles

References

  1. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios, Duplicate Record Detection: A Survey, IEEE Transactions on Knowledge and Data Engineering, Volume 19, NO. 1, January 2007.
  2. Erhard Rahm, Honghai Do, Data Cleaning: Problems and Current Approaches, IEEE Bulletin of the Technical Committee on Data Engineering, Volume 23, No. 4, Page no. 3–13, 2000.
  3. Giri, M. S., Gaur, B., & Tomar, D. (2015). A survey on data integrity techniques in cloud computing. International Journal of Computer Applications, 122(2), 27-32.
  4. Huanzhuo, Y., & Di, W. (2010). A Survey of Approximately Duplicate Data Cleaning Method. Data Analysis and Knowledge Discovery, 26(9), 56-66.
  5. Ignatov, Dmitry I., Katalin Tünde Jánosi-Rancz, and Sergei O. Kuznetzov. "Towards a framework for near-duplicate detection in document collections based on closed sets of attributes." Acta Univ. Sapientiae 1.2 (2009): 215-233.
  6. Kaur, R., Chana, I., & Bhattacharya, J. (2018). Data deduplication techniques for efficient cloud storage management: a systematic review. The Journal of Supercomputing, 74, 2035-2085.
  7. Li, Z., Xu, W., Shi, H., Zhang, Y., & Yan, Y. (2021). Security and privacy risk assessment of energy big data in cloud environment. Computational Intelligence and Neuroscience, 2021, 1-11.
  8. Teng, Y., Xian, H., Lu, Q., & Guo, F. (2022). A data deduplication scheme based on DBSCAN with tolerable clustering deviation. IEEE Access, 11, 9742-9750.
  9. Sonali Agarwal, Neera Singh, Dr. G.N. Pandey, “Implementation of Data Mining and Data Warehouse in E-Governance”, “International Journal of Computer Applications (IJCA) (0975-8887), Vol.9- No.4, ”, November 2010
  10. Xiao, Chuan, et al. "Efficient similarity joins for near-duplicate detection." ACM Transactions on Database Systems (TODS) 36.3 (2011)
  11. Matthias Friedrich's Blog, Basics of Near Duplicate Detection, http://blog.mafr.de/2011/01/06/near-duplicate-detection/
  12. Mohamed, S. M., & Wang, Y. (2021). A survey on novel classification of deduplication storage systems. Distributed and Parallel Databases, 39, 201-230.
  13. Chhabra, N., & Bala, M. (2018, December). A comparative study of data deduplication strategies. In 2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC) (pp. 68-72). IEEE.