Comparison of Hadoop Mapreduce and Apache Spark in Big Data Processing with Hgrid247-DE
Abstract
In today’s rapidly evolving information technology landscape, managing and analyzing big data has become one of the most significant challenges. This paper explores the implementation of two major frameworks for big data processing: Hadoop MapReduce and Apache Spark. Both frameworks were tested in three scenarios sorting, summarizing, and grouping using HGrid247-DE as the primary tool for data processing. A diverse set of datasets sourced from Kaggle, ranging in size from 3 MB to 260 MB, was employed to evaluate the performance of each framework. The findings reveal that Apache Spark generally outperforms Hadoop MapReduce in terms of processing speed due to its in-memory data handling capabilities. However, Hadoop MapReduce proved to be more efficient in specific scenarios, particularly when dealing with smaller tasks or when memory resources are limited. This is largely because Apache Spark can experience overhead when initializing tasks for smaller jobs. Furthermore, Hadoop MapReduce's reliance on disk I/O makes it more suitable for tasks involving vast amounts of data that surpass available memory. In contrast, Spark excels in situations where quick iterative processing and real-time data analysis are essential. This study provides valuable insights into the strengths and limitations of each framework, offering guidance for practitioners and researchers when selecting the appropriate tool for specific big data processing requirements, particularly with respect to speed, memory usage, and task complexity.
Downloads
References
S. M. Metev & V. P. Veiko, Laser Assisted Microtechnology, 2nd ed., R. M. Osgood, Jr., Ed. Berlin, Germany: Springer-Verlag, 1998.
E. Ramadhan, "Analisis Perbandingan Performa Apache Spark dan Hadoop Mapreduce pada Mapreduce Framework Menggunakan Algoritma Support Vector Machine," Sarjana thesis, Universitas Siliwangi, 2023.
Mayer-Schönberger and K. Cukier, Data: A Revolution That Will Transform How We Live, Work, and Think. New York: Eamon Dolan/Houghton Mifflin Harcourt, 2013.
A. Wibowo, Teori Ekonomi Berbasis Data. Semarang, Indonesia: Universitas Sains & Teknologi Komputer (Universitas STEKOM), 2023.
P. A. T. Taqwin, A. B. Osmond, and R. Latuconsina, "Implementasi Metode Mapreduce Pada Data Berbasis Hadoop Distributed File System," Program Studi S1 Sistem Komputer, Fakultas Teknik Elektro, Universitas Telkom, 2023.
F. S. Muhammad, "Analisis Implementasi Sistem Informasi Manajemen di Sekolah Menengah Pertama 1 Batam," *Jurnal Infotek*, vol. 6, no. 1, pp. 39-48, 2023.
R. A. Rahman, A. K. Widiastuti, dan M. A. N. E. Syafri, "Pengaruh Penerapan Metode Pembelajaran Kooperatif Tipe STAD terhadap Hasil Belajar Siswa pada Materi Kesehatan," *Jurnal Teknik Industri*, vol. 20, no. 2, pp. 205-214, 2022.
A. R. D. L. Raj, M. C. K. L. K. K. and N. R. B. K., "Introduction to Hadoop for Data," International Journal of Computer Applications, vol. 179, no. 45, pp. 1-6, 2018.
R. J. K. Dwianto and D. P. Sari, "Implementasi Apache Spark pada Data Berbasis Hadoop Distributed File System," *Jurnal Teknik Informatika*, vol. 10, no. 2, pp. 100-107, 2022.
P. A. T. Taqwin, A. B. Osmond, dan R. Latuconsina, "Implementasi Metode Mapreduce pada Data Berbasis Hadoop Distributed File System," *Jurnal Teknik Informatika*, vol. 20, no. 2, pp. 1-10, 2023.
H. B. Y. Manik, "HGRID247 Data Engineering," Data Learns 247, 2024
O. Maakoul, S. Azzouzi and M. E. H. Charaf, "An Optimal Method for Testing Jobs' Execution in MapReduce Based Systems," 2023 9th International Conference on Control, Decision and Information Technologies (CoDIT), Rome, Italy, 2023, pp. 2433-2438, doi: 10.1109/CoDIT58514.2023.10284294.
N. Nelmiawati, N. C. Kushardianto, A. H. Tohari, Y. P. Hasibuan, and D. E. Kurniawan, “Rancang Bangun Lab Komputer Virtual Berbasis Cloud Computing Menggunakan Openstack Pada Jaringan Terpusat,” Journal of Applied Informatics and Computing, vol. 2, no. 1, Art. no. 1, Jul. 2018, doi: 10.30871/jaic.v2i1.821.
D. E. Kurniawan, I. Ahmad, M. R. Ridho, F. Hidayat, and A. A. Js, “Analysis of performance comparison between Software-Based iSCSI SAN and Hardware-Based iSCSI SAN,” J. Phys.: Conf. Ser., vol. 1351, no. 1, p. 012009, Nov. 2019, doi: 10.1088/1742-6596/1351/1/012009.
A. Singh et al., "A Comparative Study of Bigdata Tools: Hadoop Vs Spark Vs Storm," 2023 IEEE 4th KhPI Week on Advanced Technology (KhPIWeek), Kharkiv, Ukraine, 2023, pp. 1-5, doi: 10.1109/KhPIWeek61412.2023.10311577.
P. Sewal and H. Singh, "A Critical Analysis of Apache Hadoop and Spark for Big Data Processing," 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC), Solan, India, 2021, pp. 308-313, doi: 10.1109/ISPCC53510.2021.9609518.
Copyright (c) 2024 Firmania Dwi Utami, Femi Dwi Astuti
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).