How to assign unique contiguous numbers to elements in a Spark RDD
In the version Spark 1.0 there are two methods is used to solve this:
- RDD.zipWithIndex is same like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so the input will be evaluated twice. Cache your input RDD if there is need to use this.
- RDD.zipWithUniqueId also gives the unique Long IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.
For the same example use case, hashed the string values.
def nnHash(tag: String) = tag.hashCode & 0x7FFFFF var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))
Eventhough hashing can be easier to manage.
The alternative best solution is using DataFrames and just concerned about the uniqueness is to use function MonotonicallyIncreasingID.
import org.apache.spark.sql.functions.monotonicallyIncreasingId val newDf = df.withColumn("uniqueIdColumn", monotonicallyIncreasingId)
MonotonicallyIncreasingID was depreciated and removed since Spark 2.0; Then it is known as monotonically_increasing_id .