What is the difference between cache and persist ?
Only the default storage level MEMORY_ONLY is used with the cache().
With persist(), The needed(rdd-persistence) storage level is specified.
From the proper document:
- In this mark an RDD to be persisted using the persist() or cache() methods on it.
- By using a different storage level, each persisted RDD can be stored
- cache() method is a shorthand used for the default storage level , which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
Here, persist() can be used if we need to assign a storage level other than MEMORY_ONLY to the RDD (which storage level to choose)
Here, No difference is seen, From RDD.scala.
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */ def persist(): this.type = persist(StorageLevel.MEMORY_ONLY) /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */ def cache(): this.type = persist()
There are five types of Storage level in the spark
Here MEMORY_ONLY is used by cache(). If there is need to use something else, use persist(StorageLevel.<*type*>).
By default persist() will store the data in the JVM heap as unserialized objects.