Apache Spark — Assign the result of UDF to multiple dataframe columns

Apache Spark — Assign the result of UDF to multiple dataframe columns

Asked on January 3, 2019 in Apache-spark.
Add Comment


  • 1 Answer(s)

    Here we cannot create multiple top level columns from a single UDF call, instead new struct can be created. In this it needs UDF with specified returnType:

    from pyspark.sql.functions import udf
    from pyspark.sql.types import *
     
    schema = StructType([
        StructField("foo", FloatType(), False),
        StructField("bar", FloatType(), False)
    ])
    def udf_test(n):
        return (n / 2, n % 2) if n and n != 0.0 else (float('nan'), float('nan'))
     
    test_udf = udf(udf_test, schema)
    df = sc.parallelize([(1, 2.0), (2, 3.0)]).toDF(["x", "y"])
     
    foobars = df.select(test_udf("y").alias("foobar"))
    foobars.printSchema()
    ## root
    ## |-- foobar: struct (nullable = true)
    ## | |-- foo: float (nullable = false)
    ## | |-- bar: float (nullable = false)
    

    Here the schema with simple select:

    foobars.select("foobar.foo", "foobar.bar").show()
    ## +---+---+
    ## |foo|bar|
    ## +---+---+
    ## |1.0|0.0|
    ## |1.5|1.0|
    ## +---+---+
    
    Answered on January 3, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.