Spark add new column to dataframe with value from previous row

Spark add new column to dataframe with value from previous row

Asked on January 3, 2019 in Apache-spark.
Add Comment


  • 1 Answer(s)

    Here lag window function can be used:

    from pyspark.sql.functions import lag, col
    from pyspark.sql.window import Window
     
    df = sc.parallelize([(4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0)]).toDF(["id", "num"])
    w = Window().partitionBy().orderBy(col("id"))
    df.select("*", lag("num").over(w).alias("new_col")).na.drop().show()
     
    ## +---+---+-------+
    ## | id|num|new_col|
    ## +---+---+-------|
    ## | 2 |3.0| 5.0   |
    ## | 3 |7.0| 3.0   |
    ## | 4 |9.0| 7.0   |
    ## +---+---+-------+
    

    These problems are important:

    When we want a global operation (not partitioned by some other column / columns) it is extremely inefficient.

    For ordering data, natural way is needed.

    Answered on January 3, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.