How to add a constant column in a Spark DataFrame ?

How to add a constant column in a Spark DataFrame ?

Asked on November 16, 2018 in Apache-spark.
Add Comment


  • 1 Answer(s)

    In the Spark 2.2+

    Here the Spark 2.2 introduces typedLit to hold Seq, Map, and Tuples and In this following calls should be supported (Scala):

    import org.apache.spark.sql.functions.typedLit
     
    df.withColumn("some_array", typedLit(Seq(1, 2, 3)))
    df.withColumn("some_struct", typedLit(("foo", 1, .0.3)))
    df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))
    

    In the Spark 1.3+ (lit), 1.4+ (array, struct), 2.0+ (map):

    For second argument, DataFrame.withColumn must be a Column so this could be used a literally:

    from pyspark.sql.functions import lit
     
    df.withColumn('new_column', lit(10))
    

    If there is a need of complex columns and then build these using blocks like array:

    import org.apache.spark.sql.functions.{array, lit, map, struct}
     
    df.withColumn("new_column", lit(10))
    df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))
    

    For providing the names for structs. alias is used on each field:

    df.withColumn(
        "some_struct",
        struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z"))
    )
    

    or by casting on the all object

    df.withColumn(
        "some_struct",
        struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>")
    )
    

    This also can be done, even if it is  slower, to use an UDF.

    Notice: Here same constructs can be used to pass constant arguments to UDFs or SQL functions.

     

    Answered on November 16, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.