Are you running Spark in Local, Standalone, YARN or Mesos mode?
If you're running in Standalone/YARN/Mesos, then the .count() action is indeed automatically parallelized across multiple Executors.
When you run a .count() on an RDD, it is actually distributing tasks to different executors to each do a local count on a local partition and then all the tasks send their sub-counts back to the driver for final aggregation. This sounds like the kind of behavior you're looking for.
However, in Local mode, everything runs in a single JVM (the driver + executor), so there's no parallelization across Executors.
Read full article from Apache Spark User List - Doing RDD."count" in parallel , at at least parallelize it as much as possible?
No comments:
Post a Comment