Spark: Best practice for retrieving big data from RDD to local machine - Stack Overflow
Wildfire answer seems semantically correct, but I'm sure you should be able to be vastly more efficient by using the API of Spark. If you want to process each partition in turn, I don't see why you can't using map
/filter
/reduce
/reduceByKey
/mapPartitions
operations. The only time you'd want to have everything in one place in one array is when your going to perform a non-monoidal operation - but that doesn't seem to be what you want. You should be able to do something like:
rdd.mapPartitions(recordsIterator => your code that processes a single chunk)
Or this
rdd.foreachPartition(partition => { partition.toArray // Your code })
Read full article from Spark: Best practice for retrieving big data from RDD to local machine - Stack Overflow
No comments:
Post a Comment