Improving Sort Performance in Apache Spark: It’s a Double | Cloudera Engineering Blog
by Sandy Ryza and Saisai (Jerry) Shao January 14, 2015 Cloudera and Intel engineers are collaborating to make Spark’s shuffle process more scalable and reliable. Here are the details about the approach’s design. What separates computation engines like MapReduce and Apache Spark (the next-generation data processing engine for Apache Hadoop) from embarrassingly parallel systems is their support for “all-to-all” operations. As distributed engines, MapReduce and Spark operate on sub-slices of a dataset partitioned across the cluster. Many operations process single data-points at a time and can be carried out fully within each partition. All-to-all operations must consider the dataset as a whole; the contents of each output record can depend on records that come from many different partitions. In Spark, groupByKey In these distributed computation engines, the “shuffle” refers to the repartitioning and aggregation of data during an all-to-all operation. Understandably, most performance,Read full article from Improving Sort Performance in Apache Spark: It’s a Double | Cloudera Engineering Blog
No comments:
Post a Comment