A Comparison Study of Different Type of Data Analysis
A Comparative Study of Performance and Development Complexity in MapReduce and Parallel SQL Database Management Systems
Keywords:
MapReduce, data analysis, parallel SQL database management systems, benchmark, performance, development complexity, parallelism, cluster, trade-offs, implementation conceptsAbstract
There is currently considerable enthusiasm around theMapReduce (MR) paradigm for large-scale data analysis. Although the basiccontrol flow of this framework has existed in parallel SQL database managementsystems (DBMS) for over 20 years, some have called MR a dramatically newcomputing model. In this paper, we describe and compare both paradigms.Furthermore, we evaluate both kinds of systems in terms of performance anddevelopment complexity. To this end, we define a benchmark consisting of acollection of tasks that we have run on an open source version of MR as well ason two parallel DBMSs. For each task, we measure each system’s performance forvarious degrees of parallelism on a cluster of 100 nodes. Our results revealsome interesting trade-offs. Although the process to load data into and tunethe execution of parallel DBMSs took much longer than the MR system, theobserved performance of these DBMSs was strikingly better. We speculate aboutthe causes of the dramatic performance difference and consider implementationconcepts that future systems should take from both kinds of architecturesDownloads
Download data is not yet available.
Published
2011-11-01
Issue
Section
Articles