How to assign unique contiguous numbers to elements in a Spark RDD

How to assign unique contiguous numbers to elements in a Spark RDD

Asked on November 23, 2018 in Apache-spark.
Add Comment


  • 3 Answer(s)

    In the version Spark 1.0 there are two methods is used to solve this:

    • RDD.zipWithIndex is same like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so the input will be evaluated twice. Cache your input RDD if there is need to use this.
    • RDD.zipWithUniqueId also gives the unique Long IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.

     

    Answered on November 23, 2018.
    Add Comment

    For the same example use case, hashed the string values.

    refer this: http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

    def nnHash(tag: String) = tag.hashCode & 0x7FFFFF
    var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))
    

    Eventhough hashing can be easier to manage.

     

    Answered on November 23, 2018.
    Add Comment

    The alternative best solution is using DataFrames and just concerned about the uniqueness is to use function MonotonicallyIncreasingID.

    import org.apache.spark.sql.functions.monotonicallyIncreasingId
    val newDf = df.withColumn("uniqueIdColumn", monotonicallyIncreasingId)
    

    MonotonicallyIncreasingID was depreciated and removed since Spark 2.0; Then it is known as monotonically_increasing_id .

     

     

    Answered on November 23, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.