All About Programming: Apache Spark User List - Doing RDD."count" in parallel , at at least parallelize it as much as possible?

Apache Spark User List - Doing RDD."count" in parallel , at at least parallelize it as much as possible?

Are you running Spark in Local, Standalone, YARN or Mesos mode?

If you're running in Standalone/YARN/Mesos, then the .count() action is indeed automatically parallelized across multiple Executors.

When you run a .count() on an RDD, it is actually distributing tasks to different executors to each do a local count on a local partition and then all the tasks send their sub-counts back to the driver for final aggregation. This sounds like the kind of behavior you're looking for.

However, in Local mode, everything runs in a single JVM (the driver + executor), so there's no parallelization across Executors.

Read full article from Apache Spark User List - Doing RDD."count" in parallel , at at least parallelize it as much as possible?

Apache Spark User List - Doing RDD."count" in parallel , at at least parallelize it as much as possible?

No comments:

Post a Comment

Labels

Popular Posts