How far will Spark RDD cache go? - Stack Overflow



How far will Spark RDD cache go? - Stack Overflow

The whole idea of cache is that spark is not keeping the results in memory unless you tell it too. So if you cache the last RDD in the chain it only keeps the results of that one in memory. So yes you do need to cache them separately, keep in mind you only need to cache an RDD if you are going to use it more than once, for example:

rdd4.cache()  val v1 = rdd4.lookup("key1")  val v2 = rdd4.lookup("key2")  

If you do not call cache in this case rdd4 will be recalculated for every call to lookup (or any other function that requires evaluation). You might want to read the paper on RDD's it is pretty easy to understand and explains the ideas behind certain choices they made regarding how RDD's work.

share|improve this answer
    
Appreciate your answer. So whenever there will be a fork, you need to cache that rdd to reduce the repetitive computation. The only pain is to unpersist on the cached rdd (since I have multiple-fork on my rdd transformation). I will read the paper again. Thanks –  EdwinGuo Sep 2 '14 at 16:35
    
@EdwinGuo don't quote me on this but I think most people find that taking the extra time to unpersist is usually more trouble than it's worth, it's better to let the JVM handle this as unresisting is a very expensive operation –  aaronman Sep 2 '14 at 16:39
    
ok, should I open up another question regarding that? trying to search on unpersist, no luck. "Mark the RDD as non-persistent, and remove all blocks for it from memory and disk." from the gitHub, did not mention to much –  EdwinGuo Sep 2 '14 at 16:57
    
@EdwinGuo if you need to, I would search the spark user group before asking –  aaronman Sep 2 '14 at 17:16
1  
Another approach to caching i heard about is more recent versions of spark may support automatic unpersisting using user defined priorities e.g. FIFO –  samthebest Sep 3 '14 at 4:36

Read full article from How far will Spark RDD cache go? - Stack Overflow


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts