Updating a dataframe column in spark
Here we cannot modify a column as such, This may operate on a column and return a new DataFrame reflecting that change. For first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:
from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import StringType name = 'target_column' udf = UserDefinedFunction(lambda x: 'new_value', StringType()) new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])
Here new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value.
Usually when updating a column, We must map an old value to a new value.
For doing that in pyspark without UDF’s:
# update df[update_col], mapping old_value --> new_value from pyspark.sql import functions as F df = df.withColumn(update_col, F.when(df[update_col]==old_value,new_value). otherwise(df[update_col])).
Actually DataFrames are based on RDDs. RDDs are immutable structures and do not allow updating elements on-site. For changing values, new DataFrame should be created by transforming the original one either using the SQL-like DSL or RDD operations like map.