All About Programming: How far will Spark RDD cache go?

How far will Spark RDD cache go? - Stack Overflow

The whole idea of cache is that spark is not keeping the results in memory unless you tell it too. So if you cache the last RDD in the chain it only keeps the results of that one in memory. So yes you do need to cache them separately, keep in mind you only need to cache an RDD if you are going to use it more than once, for example:

rdd4.cache()  val v1 = rdd4.lookup("key1")  val v2 = rdd4.lookup("key2")

If you do not call cache in this case rdd4 will be recalculated for every call to lookup (or any other function that requires evaluation). You might want to read the paper on RDD's it is pretty easy to understand and explains the ideas behind certain choices they made regarding how RDD's work.

answered Sep 2 '14 at 16:00

aaronman
8,6852636

Appreciate your answer. So whenever there will be a fork, you need to cache that rdd to reduce the repetitive computation. The only pain is to unpersist on the cached rdd (since I have multiple-fork on my rdd transformation). I will read the paper again. Thanks – EdwinGuo Sep 2 '14 at 16:35

@EdwinGuo don't quote me on this but I think most people find that taking the extra time to unpersist is usually more trouble than it's worth, it's better to let the JVM handle this as unresisting is a very expensive operation – aaronman Sep 2 '14 at 16:39

ok, should I open up another question regarding that? trying to search on unpersist, no luck. "Mark the RDD as non-persistent, and remove all blocks for it from memory and disk." from the gitHub, did not mention to much – EdwinGuo Sep 2 '14 at 16:57

@EdwinGuo if you need to, I would search the spark user group before asking – aaronman Sep 2 '14 at 17:16

Another approach to caching i heard about is more recent versions of spark may support automatic unpersisting using user defined priorities e.g. FIFO – samthebest Sep 3 '14 at 4:36

| show 2 more comments

Read full article from How far will Spark RDD cache go? - Stack Overflow

How far will Spark RDD cache go? - Stack Overflow

No comments:

Post a Comment

Labels

Popular Posts