How can I find the size of a RDD

How can I find the size of a RDD

Asked on January 8, 2019 in Apache-spark.
Add Comment


  • 3 Answer(s)

    Here RDD is made as sample and then  SizeEstimator is used to get the size of sample. With the size data sampled offline, say, X rows used Y GB offline, Z rows at runtime may take Z*Y/X GB

    For getting the size/estimate of a RDD, the below sample scala code.

    def getTotalSize(rdd: RDD[Row]): Long = {
      // This can be a parameter
      val NO_OF_SAMPLE_ROWS = 10l;
      val totalRows = rdd.count();
      var totalSize = 0l
      if (totalRows > NO_OF_SAMPLE_ROWS) {
        val sampleRDD = rdd.sample(true, NO_OF_SAMPLE_ROWS)
        val sampleRDDSize = getRDDSize(sampleRDD)
        totalSize = sampleRDDSize.*(totalRows)./(NO_OF_SAMPLE_ROWS)
      } else {
        // As the RDD is smaller than sample rows count, we can just calculate the total RDD size
        totalSize = getRDDSize(rdd)
      }
      totalSize
    }
     
    def getRDDSize(rdd: RDD[Row]) : Long = {
        var rddSize = 0l
        val rows = rdd.collect()
        for (i <- 0 until rows.length) {
            rddSize += SizeEstimator.estimate(rows.apply(i).toSeq.map { value => value.asInstanceOf[AnyRef] })
        }
        rddSize
    }
    
    Answered on January 8, 2019.
    Add Comment

    When there is need to store data in serialized form or not and then go to spark UI “Storage” page, The total size of the RDD (memory + disk) can be figured out:

    rdd.persist(StorageLevel.MEMORY_AND_DISK)
    or
    rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
    

    During runtime, calculation of memory size will not be accurate. And try by doing estimation at runtime though: based on the size data sampled offline, say, X rows used Y GB offline, Z rows at runtime may take Z*Y/X GB;

    Answered on January 8, 2019.
    Add Comment

    Here serialization has depending factors, However, Lets consider sample set and run some experimentation on that sample data, from there it is extended.

     

    Answered on January 8, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.