Create new Dataframe with empty/null field values

Create new Dataframe with empty/null field values

Asked on January 12, 2019 in Apache-spark.
Add Comment


  • 2 Answer(s)

    Here lit(null) can be used:

    import org.apache.spark.sql.functions.{lit, udf}
     
    case class Record(foo: Int, bar: String)
    val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF
     
    val dfWithFoobar = df.withColumn("foobar", lit(null: String))
    

    The reason for the issue is column type is null:

    scala> dfWithFoobar.printSchema
    root
      |-- foo: integer (nullable = false)
      |-- bar: string (nullable = true)
      |-- foobar: null (nullable = true)
      

    By the csv writer it is not retained. If it is a hard requirement ,

    with either DataType

    import org.apache.spark.sql.types.StringType
     
    df.withColumn("foobar", lit(null).cast(StringType))
    

    Or else string description

    df.withColumn("foobar", lit(null).cast("string"))
    

    The UDF can be used as follows:

    val getNull = udf(() => None: Option[String]) // Or some other type
     
    df.withColumn("foobar", getNull()).printSchema
    root
      |-- foo: integer (nullable = false)
      |-- bar: string (nullable = true)
      |-- foobar: string (nullable = true)
    

     

    Answered on January 12, 2019.
    Add Comment

    you can use one of the DataframeNafunctions.

    Here are two examples
    new_df = df.na.replace(“your column”,Map(“”->0.0)))
    new_df2 = df.na.fill(“e”,Seq(“blank”))
    You can also remove the entire rows that have null or empty values. More ideas to deal with nulls is

    https://stackoverflow.com/questions/33376571/replace-null-values-in-spark-dataframe

    Answered on January 13, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.