How to change column types in Spark SQL’s DataFrame ?

How to change column types in Spark SQL’s DataFrame ?

Asked on November 14, 2018 in Apache-spark.
Add Comment


  • 3 Answer(s)

    From the Spark version 1.4 The cast method can be applied with DataType on the column:

    import org.apache.spark.sql.types.IntegerType
    val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
        .drop("year")
        .withColumnRenamed("yearTmp", "year")
    

    By using sql expressions, This can be done by:

    val df2 = df.selectExpr("cast(year as int) year",
                            "make",
                            "model",
                           "comment",
                           "blank")
    

     

    Answered on November 14, 2018.
    Add Comment

    The Spark DataFrame is an (immutable) RDD of Rows, so there will be not replacing a column, Rather just creating new DataFrame each time with a new schema.

    Think you have an original df with the following schema:

    scala> df.printSchema
    root
    |-- Year: string (nullable = true)
    |-- Month: string (nullable = true)
    |-- DayofMonth: string (nullable = true)
    |-- DayOfWeek: string (nullable = true)
    |-- DepDelay: string (nullable = true)
    |-- Distance: string (nullable = true)
    |-- CRSDepTime: string (nullable = true)
    

    UDF’s defined on one or several columns:

    import org.apache.spark.sql.functions._
    val toInt = udf[Int, String]( _.toInt)
    val toDouble = udf[Double, String]( _.toDouble)
    val toHour = udf((t: String) => "%04d".format(t.toInt).take(2).toInt )
    val days_since_nearest_holidays = udf(
        (year:String, month:String, dayOfMonth:String) => year.toInt + 27 + month.toInt-12
    )
    

    By changing column types or even building a new DataFrame from another can be written like this:

    val featureDf = df
    .withColumn("departureDelay", toDouble(df("DepDelay")))
    .withColumn("departureHour", toHour(df("CRSDepTime")))
    .withColumn("dayOfWeek", toInt(df("DayOfWeek")))
    .withColumn("dayOfMonth", toInt(df("DayofMonth")))
    .withColumn("month", toInt(df("Month")))
    .withColumn("distance", toDouble(df("Distance")))
    .withColumn("nearestHoliday", days_since_nearest_holidays(
                   df("Year"), df("Month"), df("DayofMonth"))
               )
    .select("departureDelay", "departureHour", "dayOfWeek", "dayOfMonth",
              "month", "distance", "nearestHoliday")
    

    which yields:

    scala> df.printSchema
    root
    |-- departureDelay: double (nullable = true)
    |-- departureHour: integer (nullable = true)
    |-- dayOfWeek: integer (nullable = true)
    |-- dayOfMonth: integer (nullable = true)
    |-- month: integer (nullable = true)
    |-- distance: double (nullable = true)
    |-- nearestHoliday: integer (nullable = true)
    

    keep the type changes and other transformations as separate udf vals make the code more readable and re-usable.

     

    Answered on November 14, 2018.
    Add Comment

    Here the cast operation is available for Spark Column’s

    df.select( df("year").cast(IntegerType).as("year"), ... )
    

    As cast to the requested type  As a neat side effect, values not castable / “convertable” in that sense, will become null.

    This method will be helpful:

    object DFHelper{
        def castColumnTo( df: DataFrame, cn: String, tpe: DataType ) : DataFrame = {
            df.withColumn( cn, df(cn).cast(tpe) )
        }
    }
    

    which is like:

    import DFHelper._
    val df2 = castColumnTo( df, "year", IntegerType )
    
    Answered on November 14, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.