How far will Spark RDD cache go? - Stack Overflow
The whole idea of cache is that spark is not keeping the results in memory unless you tell it too. So if you cache the last RDD in the chain it only keeps the results of that one in memory. So yes you do need to cache them separately, keep in mind you only need to cache an RDD if you are going to use it more than once, for example: If you do not call cache in this case rdd4 will be recalculated for every call to lookup (or any other function that requires evaluation). You might want to read the paper on RDD's it is pretty easy to understand and explains the ideas behind certain choices they made regarding how RDD's work. | |||||||||||||||||||||
|
Read full article from How far will Spark RDD cache go? - Stack Overflow