# In Apache Spark, why does RDD.union not preserve the partitioner ?

In Apache Spark, why does RDD.union not preserve the partitioner ?

Here **union** is powerful operation, because the data is not moved around. If **rdd1** has 10 partitions and **rdd2** has 20 partitions then **rdd1.union(rdd2)** will have 30 partitions: The two RDDs of partitions put after each other. And there is no change.

This disable the partitioner. For a given number of partitions, A partitioner is constructed. The resulting RDD has a number of partitions that is different from both **rdd1** and** rdd2**.

We can run **repartition** to shuffle the data and organize it by key, after taking the union.

In above there is one exception. When there is same partitioner in **rdd1** and **rdd2** (with the same number of partitions), union reacts differently. This will join the partitions of the two RDDs pairwise, giving it the same number of partitions as each of the inputs had.

Here the moving data is involved around (if the partitions were not co-located) but will not involve a shuffle. So the partitioner is retained in such condition. (This gives reference for the code PartitionerAwareUnionRDD.scala.)

`union`

is a very efficient operation, because it doesn’t move any data around. If `rdd1`

has 10 partitions and `rdd2`

has 20 partitions then `rdd1.union(rdd2)`

will have 30 partitions: the partitions of the two RDDs put after each other. This is just a bookkeeping change, there is no shuffle.

But necessarily it discards the partitioner. A partitioner is constructed for a given number of partitions. The resulting RDD has a number of partitions that is different from both `rdd1`

and `rdd2`

.

After taking the union you can run `repartition`

to shuffle the data and organize it by key.

There is one exception to the above. If `rdd1`

and `rdd2`

have the same partitioner (with the same number of partitions), `union`

behaves differently. It will join the partitions of the two RDDs pairwise, giving it the same number of partitions as each of the inputs had. This may involve moving data around (if the partitions were not co-located) but will not involve a shuffle. In this case the partitioner is retained.