Updating a dataframe column in spark



  • 3 Answer(s)

    Here we cannot modify a column as such, This may operate on a column and return a new DataFrame reflecting that change. For first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:

    from pyspark.sql.functions import UserDefinedFunction
    from pyspark.sql.types import StringType
     
    name = 'target_column'
    udf = UserDefinedFunction(lambda x: 'new_value', StringType())
    new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])
    

    Here new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value.

    Answered on November 19, 2018.
    Add Comment

    Usually when updating a column,  We must map an old value to a new value.

    For doing that in pyspark without UDF’s:

    # update df[update_col], mapping old_value --> new_value
    from pyspark.sql import functions as F
    df = df.withColumn(update_col,
        F.when(df[update_col]==old_value,new_value).
        otherwise(df[update_col])).
    

     

    Answered on November 19, 2018.
    Add Comment

    Actually DataFrames are based on RDDs. RDDs are immutable structures and do not allow updating elements on-site. For changing values, new DataFrame should be created by transforming the original one either using the SQL-like DSL or RDD operations like map.

     

    Answered on November 19, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.