What is the difference between cache and persist ?

What is the difference between cache and persist ?

Asked on November 14, 2018 in Apache-spark.
Add Comment


  • 3 Answer(s)

    Cache

    Only the default storage level MEMORY_ONLY is used with the cache().

    Persist

    With persist(), The needed(rdd-persistence) storage level is specified.

    From the proper document:

    • In this mark an RDD to be persisted using the persist() or cache() methods on it.
    • By using a different storage level, each persisted RDD can be stored
    • cache() method is a shorthand used for the default storage level  , which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).

    Here, persist() can be used if we need to assign a storage level other than MEMORY_ONLY to the RDD (which storage level to choose)

    Answered on November 14, 2018.
    Add Comment

    Here, No difference is seen, From RDD.scala.

    /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
     
    /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    def cache(): this.type = persist()
    
    Answered on November 14, 2018.
    Add Comment

    There are five types of Storage level in the spark

    • MEMORY_ONLY
    • MEMORY_ONLY_SER
    • MEMORY_AND_DISK
    • MEMORY_AND_DISK_SER
    • DISK_ONLY

    Here MEMORY_ONLY is used by cache(). If there is need to use something else, use persist(StorageLevel.<*type*>).

    By default persist() will store the data in the JVM heap as unserialized objects.

    Answered on November 14, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.