All About Programming: Spark: Best practice for retrieving big data from RDD to local machine

Spark: Best practice for retrieving big data from RDD to local machine - Stack Overflow

Wildfire answer seems semantically correct, but I'm sure you should be able to be vastly more efficient by using the API of Spark. If you want to process each partition in turn, I don't see why you can't using map/filter/reduce/reduceByKey/mapPartitions operations. The only time you'd want to have everything in one place in one array is when your going to perform a non-monoidal operation - but that doesn't seem to be what you want. You should be able to do something like:

rdd.mapPartitions(recordsIterator => your code that processes a single chunk)

Or this

rdd.foreachPartition(partition => {    partition.toArray    // Your code  })

Read full article from Spark: Best practice for retrieving big data from RDD to local machine - Stack Overflow

Spark: Best practice for retrieving big data from RDD to local machine - Stack Overflow

No comments:

Post a Comment

Labels

Popular Posts