关于spark:Spark的cache和persist的区别

spark 的数据长久化

昨天面试的时候被问到了 spark cache 和 persist 的区别, 明天学习了一下并做一些记录
首先要理解的是 RDD 是 lazy 的，具体贴一段 stack over flow 的解答，很具体的介绍了怎么了解 RDD，加 cache 与不加有什么区别，这个区别具体作用在哪里。

Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:

val textFile = sc.textFile("/user/emp.txt")

It does nothing. It creates an RDD that says “we will need to load this file”. The file is not loaded at this point.

RDD operations that require observing the contents of the data cannot be lazy. (These are called _actions_.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.

What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.

So what does RDD.cache do? If you add textFile.cache to the above code:

val textFile = sc.textFile("/user/emp.txt")
textFile.cache

It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says “read this file and then cache the contents”. If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.

The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.

cache()是 persist()的无参数化调用, 对于 RDD 来说，用 cache 就抉择了默认的 storage level：MEMORY_ONLY

spark 官网对于 STORAGE_LEVEL 的解释
http://spark.apache.org/docs/latest/rdd-programming-guide.html#which-storage-level-to-choose

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.