Renaming column names of a DataFrame in Spark Scala

Renaming column names of a DataFrame in Spark Scala

Asked on November 16, 2018 in Apache-spark.
Add Comment


  • 2 Answer(s)

    For the flat structure:

    val df = Seq((1L, "a", "foo", 3.0)).toDF
    df.printSchema
    // root
    //   |-- _1: long (nullable = false)
    //   |-- _2: string (nullable = true)
    //   |-- _3: string (nullable = true)
    //   |-- _4: double (nullable = false)
    

    Here the toDF method is used, which could be very simplest thing:

    val newNames = Seq("id", "x1", "x2", "x3")
    val dfRenamed = df.toDF(newNames: _*)
    dfRenamed.printSchema
    // root
    //   |-- id: long (nullable = false)
    //   |-- x1: string (nullable = true)
    //   |-- x2: string (nullable = true)
    //   |-- x3: double (nullable = false)
    

    If there is any need to rename individual columns, Then select with alias can be used:

    df.select($"_1".alias("x1"))
    

    This could be easily generalized to multiple columns:

    val lookup = Map("_1" -> "foo", "_3" -> "bar")
     
    df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)
    

    or withColumnRenamed:

    df.withColumnRenamed("_1", "x1")
    

    This can be  used with foldLeft to rename multiple columns

    lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))
    

    With nested structures (structs) one possible option is renaming by selecting a whole structure:

    val nested = spark.read.json(sc.parallelize(Seq(
        """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
    )))
     
    nested.printSchema
    // root
    // |-- foobar: struct (nullable = true)
    // | |-- foo: struct (nullable = true)
    // | | |-- bar: struct (nullable = true)
    // | | | |-- first: double (nullable = true)
    // | | | |-- second: double (nullable = true)
    // |-- id: long (nullable = true)
    @transient val foobarRenamed = struct(
       struct(
        struct(
          $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
        ).alias("point")
      ).alias("location")
    ).alias("record")
     
    nested.select(foobarRenamed, $"id").printSchema
    // root
    // |-- record: struct (nullable = false)
    // | |-- location: struct (nullable = false)
    // | | |-- point: struct (nullable = false)
    // | | | |-- x: double (nullable = true)
    // | | | |-- y: double (nullable = true)
    // |-- id: long (nullable = true)
    

    Make sure that it could affect nullability metadata. Here is another possibility is to rename by casting:

    nested.select($"foobar".cast(
      "struct<location:struct<point:struct<x:double,y:double>>>"
    ).alias("record")).printSchema
     
    // root
    // |-- record: struct (nullable = true)
    // | |-- location: struct (nullable = true)
    // | | |-- point: struct (nullable = true)
    // | | | |-- x: double (nullable = true)
    // | | | |-- y: double (nullable = true)
    

    or:

    import org.apache.spark.sql.types._
     
    nested.select($"foobar".cast(
      StructType(Seq(
        StructField("location", StructType(Seq(
          StructField("point", StructType(Seq(
            StructField("x", DoubleType), StructField("y", DoubleType)))))))))
    ).alias("record")).printSchema
     
    // root
    // |-- record: struct (nullable = true)
    // | |-- location: struct (nullable = true)
    // | | |-- point: struct (nullable = true)
    // | | | |-- x: double (nullable = true)
    // | | | |-- y: double (nullable = true)
    
    Answered on November 16, 2018.
    Add Comment

    For the PySpark version, (Basically it is same in Scala )

    merchants_df_renamed = merchants_df.toDF(
        'merchant_id', 'category', 'subcategory', 'merchant')
     
    merchants_df_renamed.printSchema()
    

    The output will be:

    root
    |– merchant_id: integer (nullable = true)
    |– category: string (nullable = true)
    |– subcategory: string (nullable = true)
    |– merchant: string (nullable = true)

     

    Answered on November 16, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.